master
1
2
3
4
5
6
7Internet Engineering Task Force (IETF) Y. Collet
8Request for Comments: 8478 M. Kucherawy, Ed.
9Category: Informational Facebook
10ISSN: 2070-1721 October 2018
11
12
13 Zstandard Compression and the application/zstd Media Type
14
15Abstract
16
17 Zstandard, or "zstd" (pronounced "zee standard"), is a data
18 compression mechanism. This document describes the mechanism and
19 registers a media type and content encoding to be used when
20 transporting zstd-compressed content via Multipurpose Internet Mail
21 Extensions (MIME).
22
23 Despite use of the word "standard" as part of its name, readers are
24 advised that this document is not an Internet Standards Track
25 specification; it is being published for informational purposes only.
26
27Status of This Memo
28
29 This document is not an Internet Standards Track specification; it is
30 published for informational purposes.
31
32 This document is a product of the Internet Engineering Task Force
33 (IETF). It represents the consensus of the IETF community. It has
34 received public review and has been approved for publication by the
35 Internet Engineering Steering Group (IESG). Not all documents
36 approved by the IESG are candidates for any level of Internet
37 Standard; see Section 2 of RFC 7841.
38
39 Information about the current status of this document, any errata,
40 and how to provide feedback on it may be obtained at
41 https://www.rfc-editor.org/info/rfc8478.
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58Collet & Kucherawy Informational [Page 1]
59
60RFC 8478 application/zstd October 2018
61
62
63Copyright Notice
64
65 Copyright (c) 2018 IETF Trust and the persons identified as the
66 document authors. All rights reserved.
67
68 This document is subject to BCP 78 and the IETF Trust's Legal
69 Provisions Relating to IETF Documents
70 (https://trustee.ietf.org/license-info) in effect on the date of
71 publication of this document. Please review these documents
72 carefully, as they describe your rights and restrictions with respect
73 to this document. Code Components extracted from this document must
74 include Simplified BSD License text as described in Section 4.e of
75 the Trust Legal Provisions and are provided without warranty as
76 described in the Simplified BSD License.
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114Collet & Kucherawy Informational [Page 2]
115
116RFC 8478 application/zstd October 2018
117
118
119Table of Contents
120
121 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4
122 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4
123 3. Compression Algorithm . . . . . . . . . . . . . . . . . . . . 5
124 3.1. Frames . . . . . . . . . . . . . . . . . . . . . . . . . 6
125 3.1.1. Zstandard Frames . . . . . . . . . . . . . . . . . . 6
126 3.1.1.1. Frame Header . . . . . . . . . . . . . . . . . . 7
127 3.1.1.2. Blocks . . . . . . . . . . . . . . . . . . . . . 12
128 3.1.1.3. Compressed Blocks . . . . . . . . . . . . . . . . 14
129 3.1.1.4. Sequence Execution . . . . . . . . . . . . . . . 28
130 3.1.1.5. Repeat Offsets . . . . . . . . . . . . . . . . . 29
131 3.1.2. Skippable Frames . . . . . . . . . . . . . . . . . . 30
132 4. Entropy Encoding . . . . . . . . . . . . . . . . . . . . . . 30
133 4.1. FSE . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
134 4.1.1. FSE Table Description . . . . . . . . . . . . . . . . 31
135 4.2. Huffman Coding . . . . . . . . . . . . . . . . . . . . . 34
136 4.2.1. Huffman Tree Description . . . . . . . . . . . . . . 35
137 4.2.1.1. Huffman Tree Header . . . . . . . . . . . . . . . 36
138 4.2.1.2. FSE Compression of Huffman Weights . . . . . . . 37
139 4.2.1.3. Conversion from Weights to Huffman Prefix Codes . 38
140 4.2.2. Huffman-Coded Streams . . . . . . . . . . . . . . . . 39
141 5. Dictionary Format . . . . . . . . . . . . . . . . . . . . . . 40
142 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 42
143 6.1. The 'application/zstd' Media Type . . . . . . . . . . . . 42
144 6.2. Content Encoding . . . . . . . . . . . . . . . . . . . . 43
145 6.3. Dictionaries . . . . . . . . . . . . . . . . . . . . . . 43
146 7. Security Considerations . . . . . . . . . . . . . . . . . . . 43
147 8. Implementation Status . . . . . . . . . . . . . . . . . . . . 44
148 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 45
149 9.1. Normative References . . . . . . . . . . . . . . . . . . 45
150 9.2. Informative References . . . . . . . . . . . . . . . . . 45
151 Appendix A. Decoding Tables for Predefined Codes . . . . . . . . 46
152 A.1. Literal Length Code Table . . . . . . . . . . . . . . . . 46
153 A.2. Match Length Code Table . . . . . . . . . . . . . . . . . 49
154 A.3. Offset Code Table . . . . . . . . . . . . . . . . . . . . 52
155 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 53
156 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 54
157
158
159
160
161
162
163
164
165
166
167
168
169
170Collet & Kucherawy Informational [Page 3]
171
172RFC 8478 application/zstd October 2018
173
174
1751. Introduction
176
177 Zstandard, or "zstd" (pronounced "zee standard"), is a data
178 compression mechanism, akin to gzip [RFC1952].
179
180 Despite use of the word "standard" as part of its name, readers are
181 advised that this document is not an Internet Standards Track
182 specification; it is being published for informational purposes only.
183
184 This document describes the Zstandard format. Also, to enable the
185 transport of a data object compressed with Zstandard, this document
186 registers a media type that can be used to identify such content when
187 it is used in a payload encoded using Multipurpose Internet Mail
188 Extensions (MIME).
189
1902. Definitions
191
192 Some terms used elsewhere in this document are defined here for
193 clarity.
194
195 uncompressed: Describes an arbitrary set of bytes in their original
196 form, prior to being subjected to compression.
197
198 compress, compression: The act of processing a set of bytes via the
199 compression mechanism described here.
200
201 compressed: Describes the result of passing a set of bytes through
202 this mechanism. The original input has thus been compressed.
203
204 decompress, decompression: The act of processing a set of bytes
205 through the inverse of the compression mechanism described here,
206 in an attempt to recover the original set of bytes prior to
207 compression.
208
209 decompressed: Describes the result of passing a set of bytes through
210 the reverse of this mechanism. When this is successful, the
211 decompressed payload and the uncompressed payload are
212 indistinguishable.
213
214 encode: The process of translating data from one form to another;
215 this may include compression or it may refer to other translations
216 done as part of this specification.
217
218 decode: The reverse of "encode"; describes a process of reversing a
219 prior encoding to recover the original content.
220
221
222
223
224
225
226Collet & Kucherawy Informational [Page 4]
227
228RFC 8478 application/zstd October 2018
229
230
231 frame: Content compressed by Zstandard is transformed into a
232 Zstandard frame. Multiple frames can be appended into a single
233 file or stream. A frame is completely independent, has a defined
234 beginning and end, and has a set of parameters that tells the
235 decoder how to decompress it.
236
237 block: A frame encapsulates one or multiple blocks. Each block
238 contains arbitrary content, which is described by its header, and
239 has a guaranteed maximum content size that depends upon frame
240 parameters. Unlike frames, each block depends on previous blocks
241 for proper decoding. However, each block can be decompressed
242 without waiting for its successor, allowing streaming operations.
243
244 natural order: A sequence or ordering of objects or values that is
245 typical of that type of object or value. A set of unique
246 integers, for example, is in "natural order" if when progressing
247 from one element in the set or sequence to the next, there is
248 never a decrease in value.
249
250 The naming convention for identifiers within the specification is
251 Mixed_Case_With_Underscores. Identifiers inside square brackets
252 indicate that the identifier is optional in the presented context.
253
2543. Compression Algorithm
255
256 This section describes the Zstandard algorithm.
257
258 The purpose of this document is to define a lossless compressed data
259 format that is a) independent of the CPU type, operating system, file
260 system, and character set and b) is suitable for file compression and
261 pipe and streaming compression, using the Zstandard algorithm. The
262 text of the specification assumes a basic background in programming
263 at the level of bits and other primitive data representations.
264
265 The data can be produced or consumed, even for an arbitrarily long
266 sequentially presented input data stream, using only an a priori
267 bounded amount of intermediate storage, and hence can be used in data
268 communications. The format uses the Zstandard compression method,
269 and an optional xxHash-64 checksum method [XXHASH], for detection of
270 data corruption.
271
272 The data format defined by this specification does not attempt to
273 allow random access to compressed data.
274
275 Unless otherwise indicated below, a compliant compressor must produce
276 data sets that conform to the specifications presented here.
277 However, it does not need to support all options.
278
279
280
281
282Collet & Kucherawy Informational [Page 5]
283
284RFC 8478 application/zstd October 2018
285
286
287 A compliant decompressor must be able to decompress at least one
288 working set of parameters that conforms to the specifications
289 presented here. It may also ignore informative fields, such as the
290 checksum. Whenever it does not support a parameter defined in the
291 compressed stream, it must produce a non-ambiguous error code and
292 associated error message explaining which parameter is unsupported.
293
294 This specification is intended for use by implementers of software to
295 compress data into Zstandard format and/or decompress data from
296 Zstandard format. The Zstandard format is supported by an open
297 source reference implementation, written in portable C, and available
298 at [ZSTD].
299
3003.1. Frames
301
302 Zstandard compressed data is made up of one or more frames. Each
303 frame is independent and can be decompressed independently of other
304 frames. The decompressed content of multiple concatenated frames is
305 the concatenation of each frame's decompressed content.
306
307 There are two frame formats defined for Zstandard: Zstandard frames
308 and skippable frames. Zstandard frames contain compressed data,
309 while skippable frames contain custom user metadata.
310
3113.1.1. Zstandard Frames
312
313 The structure of a single Zstandard frame is as follows:
314
315 +--------------------+------------+
316 | Magic_Number | 4 bytes |
317 +--------------------+------------+
318 | Frame_Header | 2-14 bytes |
319 +--------------------+------------+
320 | Data_Block | n bytes |
321 +--------------------+------------+
322 | [More Data_Blocks] | |
323 +--------------------+------------+
324 | [Content_Checksum] | 0-4 bytes |
325 +--------------------+------------+
326
327 Magic_Number: 4 bytes, little-endian format. Value: 0xFD2FB528.
328
329 Frame_Header: 2 to 14 bytes, detailed in Section 3.1.1.1.
330
331 Data_Block: Detailed in Section 3.1.1.2. This is where data
332 appears.
333
334
335
336
337
338Collet & Kucherawy Informational [Page 6]
339
340RFC 8478 application/zstd October 2018
341
342
343 Content_Checksum: An optional 32-bit checksum, only present if
344 Content_Checksum_Flag is set. The content checksum is the result
345 of the XXH64() hash function [XXHASH] digesting the original
346 (decoded) data as input, and a seed of zero. The low 4 bytes of
347 the checksum are stored in little-endian format.
348
349 The magic number was selected to be less probable to find at the
350 beginning of an arbitrary file. It avoids trivial patterns (0x00,
351 0xFF, repeated bytes, increasing bytes, etc.), contains byte values
352 outside of ASCII range, and doesn't map into UTF-8 space, all of
353 which reduce the likelihood of its appearance at the top of a text
354 file.
355
3563.1.1.1. Frame Header
357
358 The frame header has a variable size, with a minimum of 2 bytes and
359 up to 14 bytes depending on optional parameters. The structure of
360 Frame_Header is as follows:
361
362 +-------------------------+-----------+
363 | Frame_Header_Descriptor | 1 byte |
364 +-------------------------+-----------+
365 | [Window_Descriptor] | 0-1 byte |
366 +-------------------------+-----------+
367 | [Dictionary_ID] | 0-4 bytes |
368 +-------------------------+-----------+
369 | [Frame_Content_Size] | 0-8 bytes |
370 +-------------------------+-----------+
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394Collet & Kucherawy Informational [Page 7]
395
396RFC 8478 application/zstd October 2018
397
398
3993.1.1.1.1. Frame_Header_Descriptor
400
401 The first header's byte is called the Frame_Header_Descriptor. It
402 describes which other fields are present. Decoding this byte is
403 enough to tell the size of Frame_Header.
404
405 +------------+-------------------------+
406 | Bit Number | Field Name |
407 +------------+-------------------------+
408 | 7-6 | Frame_Content_Size_Flag |
409 +------------+-------------------------+
410 | 5 | Single_Segment_Flag |
411 +------------+-------------------------+
412 | 4 | (unused) |
413 +------------+-------------------------+
414 | 3 | (reserved) |
415 +------------+-------------------------+
416 | 2 | Content_Checksum_Flag |
417 +------------+-------------------------+
418 | 1-0 | Dictionary_ID_Flag |
419 +------------+-------------------------+
420
421 In this table, bit 7 is the highest bit, while bit 0 is the lowest
422 one.
423
4243.1.1.1.1.1. Frame_Content_Size_Flag
425
426 This is a 2-bit flag (equivalent to Frame_Header_Descriptor right-
427 shifted 6 bits) specifying whether Frame_Content_Size (the
428 decompressed data size) is provided within the header. Flag_Value
429 provides FCS_Field_Size, which is the number of bytes used by
430 Frame_Content_Size according to the following table:
431
432 +----------------+--------+---+---+---+
433 | Flag_Value | 0 | 1 | 2 | 3 |
434 +----------------+--------+---+---+---+
435 | FCS_Field_Size | 0 or 1 | 2 | 4 | 8 |
436 +----------------+--------+---+---+---+
437
438 When Flag_Value is 0, FCS_Field_Size depends on Single_Segment_Flag:
439 If Single_Segment_Flag is set, FCS_Field_Size is 1. Otherwise,
440 FCS_Field_Size is 0; Frame_Content_Size is not provided.
441
442
443
444
445
446
447
448
449
450Collet & Kucherawy Informational [Page 8]
451
452RFC 8478 application/zstd October 2018
453
454
4553.1.1.1.1.2. Single_Segment_Flag
456
457 If this flag is set, data must be regenerated within a single
458 continuous memory segment.
459
460 In this case, Window_Descriptor byte is skipped, but
461 Frame_Content_Size is necessarily present. As a consequence, the
462 decoder must allocate a memory segment of size equal or larger than
463 Frame_Content_Size.
464
465 In order to protect the decoder from unreasonable memory
466 requirements, a decoder is allowed to reject a compressed frame that
467 requests a memory size beyond the decoder's authorized range.
468
469 For broader compatibility, decoders are recommended to support memory
470 sizes of at least 8 MB. This is only a recommendation; each decoder
471 is free to support higher or lower limits, depending on local
472 limitations.
473
4743.1.1.1.1.3. Unused Bit
475
476 A decoder compliant with this specification version shall not
477 interpret this bit. It might be used in a future version, to signal
478 a property that is not mandatory to properly decode the frame. An
479 encoder compliant with this specification must set this bit to zero.
480
4813.1.1.1.1.4. Reserved Bit
482
483 This bit is reserved for some future feature. Its value must be
484 zero. A decoder compliant with this specification version must
485 ensure it is not set. This bit may be used in a future revision, to
486 signal a feature that must be interpreted to decode the frame
487 correctly.
488
4893.1.1.1.1.5. Content_Checksum_Flag
490
491 If this flag is set, a 32-bit Content_Checksum will be present at the
492 frame's end. See the description of Content_Checksum above.
493
494
495
496
497
498
499
500
501
502
503
504
505
506Collet & Kucherawy Informational [Page 9]
507
508RFC 8478 application/zstd October 2018
509
510
5113.1.1.1.1.6. Dictionary_ID_Flag
512
513 This is a 2-bit flag (= Frame_Header_Descriptor & 0x3) indicating
514 whether a dictionary ID is provided within the header. It also
515 specifies the size of this field as DID_Field_Size:
516
517 +----------------+---+---+---+---+
518 | Flag_Value | 0 | 1 | 2 | 3 |
519 +----------------+---+---+---+---+
520 | DID_Field_Size | 0 | 1 | 2 | 4 |
521 +----------------+---+---+---+---+
522
5233.1.1.1.2. Window Descriptor
524
525 This provides guarantees about the minimum memory buffer required to
526 decompress a frame. This information is important for decoders to
527 allocate enough memory.
528
529 The Window_Descriptor byte is optional. When Single_Segment_Flag is
530 set, Window_Descriptor is not present. In this case, Window_Size is
531 Frame_Content_Size, which can be any value from 0 to 2^64-1 bytes (16
532 ExaBytes).
533
534 +------------+----------+----------+
535 | Bit Number | 7-3 | 2-0 |
536 +------------+----------+----------+
537 | Field Name | Exponent | Mantissa |
538 +------------+----------+----------+
539
540 The minimum memory buffer size is called Window_Size. It is
541 described by the following formulae:
542
543 windowLog = 10 + Exponent;
544 windowBase = 1 << windowLog;
545 windowAdd = (windowBase / 8) * Mantissa;
546 Window_Size = windowBase + windowAdd;
547
548 The minimum Window_Size is 1 KB. The maximum Window_Size is (1<<41)
549 + 7*(1<<38) bytes, which is 3.75 TB.
550
551 In general, larger Window_Size values tend to improve the compression
552 ratio, but at the cost of increased memory usage.
553
554 To properly decode compressed data, a decoder will need to allocate a
555 buffer of at least Window_Size bytes.
556
557
558
559
560
561
562Collet & Kucherawy Informational [Page 10]
563
564RFC 8478 application/zstd October 2018
565
566
567 In order to protect decoders from unreasonable memory requirements, a
568 decoder is allowed to reject a compressed frame that requests a
569 memory size beyond decoder's authorized range.
570
571 For improved interoperability, it's recommended for decoders to
572 support values of Window_Size up to 8 MB and for encoders not to
573 generate frames requiring a Window_Size larger than 8 MB. It's
574 merely a recommendation though, and decoders are free to support
575 larger or lower limits, depending on local limitations.
576
5773.1.1.1.3. Dictionary_ID
578
579 This is a variable size field, which contains the ID of the
580 dictionary required to properly decode the frame. This field is
581 optional. When it's not present, it's up to the decoder to know
582 which dictionary to use.
583
584 Dictionary_ID field size is provided by DID_Field_Size.
585 DID_Field_Size is directly derived from the value of
586 Dictionary_ID_Flag. One byte can represent an ID 0-255; 2 bytes can
587 represent an ID 0-65535; 4 bytes can represent an ID 0-4294967295.
588 Format is little-endian.
589
590 It is permitted to represent a small ID (for example, 13) with a
591 large 4-byte dictionary ID, even if it is less efficient.
592
593 Within private environments, any dictionary ID can be used. However,
594 for frames and dictionaries distributed in public space,
595 Dictionary_ID must be attributed carefully. The following ranges are
596 reserved for use only with dictionaries that have been registered
597 with IANA (see Section 6.3):
598
599 low range: <= 32767
600 high range: >= (1 << 31)
601
602 Any other value for Dictionary_ID can be used by private arrangement
603 between participants.
604
605 Any payload presented for decompression that references an
606 unregistered reserved dictionary ID results in an error.
607
608
609
610
611
612
613
614
615
616
617
618Collet & Kucherawy Informational [Page 11]
619
620RFC 8478 application/zstd October 2018
621
622
6233.1.1.1.4. Frame Content Size
624
625 This is the original (uncompressed) size. This information is
626 optional. Frame_Content_Size uses a variable number of bytes,
627 provided by FCS_Field_Size. FCS_Field_Size is provided by the value
628 of Frame_Content_Size_Flag. FCS_Field_Size can be equal to 0 (not
629 present), 1, 2, 4, or 8 bytes.
630
631 +----------------+--------------+
632 | FCS Field Size | Range |
633 +----------------+--------------+
634 | 0 | unknown |
635 +----------------+--------------+
636 | 1 | 0 - 255 |
637 +----------------+--------------+
638 | 2 | 256 - 65791 |
639 +----------------+--------------+
640 | 4 | 0 - 2^32 - 1 |
641 +----------------+--------------+
642 | 8 | 0 - 2^64 - 1 |
643 +----------------+--------------+
644
645 Frame_Content_Size format is little-endian. When FCS_Field_Size is
646 1, 4, or 8 bytes, the value is read directly. When FCS_Field_Size is
647 2, the offset of 256 is added. It's allowed to represent a small
648 size (for example 18) using any compatible variant.
649
6503.1.1.2. Blocks
651
652 After Magic_Number and Frame_Header, there are some number of blocks.
653 Each frame must have at least 1 block, but there is no upper limit on
654 the number of blocks per frame.
655
656 The structure of a block is as follows:
657
658 +--------------+---------------+
659 | Block_Header | Block_Content |
660 +--------------+---------------+
661 | 3 bytes | n bytes |
662 +--------------+---------------+
663
664
665
666
667
668
669
670
671
672
673
674Collet & Kucherawy Informational [Page 12]
675
676RFC 8478 application/zstd October 2018
677
678
679 Block_Header uses 3 bytes, written using little-endian convention.
680 It contains three fields:
681
682 +------------+------------+------------+
683 | Last_Block | Block_Type | Block_Size |
684 +------------+------------+------------+
685 | bit 0 | bits 1-2 | bits 3-23 |
686 +------------+------------+------------+
687
6883.1.1.2.1. Last_Block
689
690 The lowest bit (Last_Block) signals whether this block is the last
691 one. The frame will end after this last block. It may be followed
692 by an optional Content_Checksum (see Section 3.1.1).
693
6943.1.1.2.2. Block_Type
695
696 The next 2 bits represent the Block_Type. There are four block
697 types:
698
699 +-----------+------------------+
700 | Value | Block_Type |
701 +-----------+------------------+
702 | 0 | Raw_Block |
703 +-----------+------------------+
704 | 1 | RLE_Block |
705 +-----------+------------------+
706 | 2 | Compressed_Block |
707 +-----------+------------------+
708 | 3 | Reserved |
709 +-----------+------------------+
710
711 Raw_Block: This is an uncompressed block. Block_Content contains
712 Block_Size bytes.
713
714 RLE_Block: This is a single byte, repeated Block_Size times.
715 Block_Content consists of a single byte. On the decompression
716 side, this byte must be repeated Block_Size times.
717
718 Compressed_Block: This is a compressed block as described in
719 Section 3.1.1.3. Block_Size is the length of Block_Content,
720 namely the compressed data. The decompressed size is not known,
721 but its maximum possible value is guaranteed (see below).
722
723 Reserved: This is not a block. This value cannot be used with the
724 current specification. If such a value is present, it is
725 considered to be corrupt data.
726
727
728
729
730Collet & Kucherawy Informational [Page 13]
731
732RFC 8478 application/zstd October 2018
733
734
7353.1.1.2.3. Block_Size
736
737 The upper 21 bits of Block_Header represent the Block_Size.
738 Block_Size is the size of the block excluding the header. A block
739 can contain any number of bytes (even zero), up to
740 Block_Maximum_Decompressed_Size, which is the smallest of:
741
742 o Window_Size
743
744 o 128 KB
745
746 A Compressed_Block has the extra restriction that Block_Size is
747 always strictly less than the decompressed size. If this condition
748 cannot be respected, the block must be sent uncompressed instead
749 (i.e., treated as a Raw_Block).
750
7513.1.1.3. Compressed Blocks
752
753 To decompress a compressed block, the compressed size must be
754 provided from the Block_Size field within Block_Header.
755
756 A compressed block consists of two sections: a Literals
757 Section (Section 3.1.1.3.1) and a
758 Sequences_Section (Section 3.1.1.3.2). The results of the two
759 sections are then combined to produce the decompressed data in
760 Sequence Execution (Section 3.1.1.4).
761
762 To decode a compressed block, the following elements are necessary:
763
764 o Previous decoded data, up to a distance of Window_Size, or the
765 beginning of the Frame, whichever is smaller. Single_Segment_Flag
766 will be set in the latter case.
767
768 o List of "recent offsets" from the previous Compressed_Block.
769
770 o The previous Huffman tree, required by Treeless_Literals_Block
771 type.
772
773 o Previous Finite State Entropy (FSE) decoding tables, required by
774 Repeat_Mode, for each symbol type (literals lengths, match
775 lengths, offsets).
776
777 Note that decoding tables are not always from the previous
778 Compressed_Block:
779
780 o Every decoding table can come from a dictionary.
781
782
783
784
785
786Collet & Kucherawy Informational [Page 14]
787
788RFC 8478 application/zstd October 2018
789
790
791 o The Huffman tree comes from the previous
792 Compressed_Literals_Block.
793
7943.1.1.3.1. Literals_Section_Header
795
796 All literals are regrouped in the first part of the block. They can
797 be decoded first and then copied during Sequence Execution (see
798 Section 3.1.1.4), or they can be decoded on the flow during Sequence
799 Execution.
800
801 Literals can be stored uncompressed or compressed using Huffman
802 prefix codes. When compressed, an optional tree description can be
803 present, followed by 1 or 4 streams.
804
805 +----------------------------+
806 | Literals_Section_Header |
807 +----------------------------+
808 | [Huffman_Tree_Description] |
809 +----------------------------+
810 | [Jump_Table] |
811 +----------------------------+
812 | Stream_1 |
813 +----------------------------+
814 | [Stream_2] |
815 +----------------------------+
816 | [Stream_3] |
817 +----------------------------+
818 | [Stream_4] |
819 +----------------------------+
820
8213.1.1.3.1.1. Literals_Section_Header
822
823 This field describes how literals are packed. It's a byte-aligned
824 variable-size bit field, ranging from 1 to 5 bytes, using little-
825 endian convention.
826
827 +---------------------+-----------+
828 | Literals_Block_Type | 2 bits |
829 +---------------------+-----------+
830 | Size_Format | 1-2 bits |
831 +---------------------+-----------+
832 | Regenerated_Size | 5-20 bits |
833 +---------------------+-----------+
834 | [Compressed_Size] | 0-18 bits |
835 +---------------------+-----------+
836
837 In this representation, bits at the top are the lowest bits.
838
839
840
841
842Collet & Kucherawy Informational [Page 15]
843
844RFC 8478 application/zstd October 2018
845
846
847 The Literals_Block_Type field uses the two lowest bits of the first
848 byte, describing four different block types:
849
850 +---------------------------+-------+
851 | Literals_Block_Type | Value |
852 +---------------------------+-------+
853 | Raw_Literals_Block | 0 |
854 +---------------------------+-------+
855 | RLE_Literals_Block | 1 |
856 +---------------------------+-------+
857 | Compressed_Literals_Block | 2 |
858 +---------------------------+-------+
859 | Treeless_Literals_Block | 3 |
860 +---------------------------+-------+
861
862 Raw_Literals_Block: Literals are stored uncompressed.
863 Literals_Section_Content is Regenerated_Size.
864
865 RLE_Literals_Block: Literals consist of a single-byte value repeated
866 Regenerated_Size times. Literals_Section_Content is 1.
867
868 Compressed_Literals_Block: This is a standard Huffman-compressed
869 block, starting with a Huffman tree description. See details
870 below. Literals_Section_Content is Compressed_Size.
871
872 Treeless_Literals_Block: This is a Huffman-compressed block, using
873 the Huffman tree from the previous Compressed_Literals_Block, or a
874 dictionary if there is no previous Huffman-compressed literals
875 block. Huffman_Tree_Description will be skipped. Note that if
876 this mode is triggered without any previous Huffman-table in the
877 frame (or dictionary, per Section 5), it should be treated as data
878 corruption. Literals_Section_Content is Compressed_Size.
879
880 The Size_Format is divided into two families:
881
882 o For Raw_Literals_Block and RLE_Literals_Block, it's only necessary
883 to decode Regenerated_Size. There is no Compressed_Size field.
884
885 o For Compressed_Block and Treeless_Literals_Block, it's required to
886 decode both Compressed_Size and Regenerated_Size (the decompressed
887 size). It's also necessary to decode the number of streams (1 or
888 4).
889
890 For values spanning several bytes, the convention is little endian.
891
892 Size_Format for Raw_Literals_Block and RLE_Literals_Block uses 1 or 2
893 bits. Its value is (Literals_Section_Header[0]>>2) & 0x3.
894
895
896
897
898Collet & Kucherawy Informational [Page 16]
899
900RFC 8478 application/zstd October 2018
901
902
903 Size_Format == 00 or 10: Size_Format uses 1 bit. Regenerated_Size
904 uses 5 bits (value 0-31). Literals_Section_Header uses 1 byte.
905 Regenerated_Size = Literal_Section_Header[0]>>3.
906
907 Size_Format == 01: Size_Format uses 2 bits. Regenerated_Size uses
908 12 bits (values 0-4095). Literals_Section_Header uses 2 bytes.
909 Regenerated_Size = (Literals_Section_Header[0]>>4) +
910 (Literals_Section_Header[1]<<4).
911
912 Size_Format == 11: Size_Format uses 2 bits. Regenerated_Size uses
913 20 bits (values 0-1048575). Literals_Section_Header uses 3 bytes.
914 Regenerated_Size = (Literals_Section_Header[0]>>4) +
915 (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12)
916
917 Only Stream_1 is present for these cases. Note that it is permitted
918 to represent a short value (for example, 13) using a long format,
919 even if it's less efficient.
920
921 Size_Format for Compressed_Literals_Block and Treeless_Literals_Block
922 always uses 2 bits.
923
924 Size_Format == 00: A single stream. Both Regenerated_Size and
925 Compressed_Size use 10 bits (values 0-1023).
926 Literals_Section_Header uses 3 bytes.
927
928 Size_Format == 01: 4 streams. Both Regenerated_Size and
929 Compressed_Size use 10 bits (values 0-1023).
930 Literals_Section_Header uses 3 bytes.
931
932 Size_Format == 10: 4 streams. Both Regenerated_Size and
933 Compressed_Size use 14 bits (values 0-16383).
934 Literals_Section_Header uses 4 bytes.
935
936 Size_Format == 11: 4 streams. Both Regenerated_Size and
937 Compressed_Size use 18 bits (values 0-262143).
938 Literals_Section_Header uses 5 bytes.
939
940 Both the Compressed_Size and Regenerated_Size fields follow little-
941 endian convention. Note that Compressed_Size includes the size of
942 the Huffman_Tree_Description when it is present.
943
9443.1.1.3.1.2. Raw_Literals_Block
945
946 The data in Stream_1 is Regenerated_Size bytes long. It contains the
947 raw literals data to be used during Sequence Execution
948 (Section 3.1.1.3.2).
949
950
951
952
953
954Collet & Kucherawy Informational [Page 17]
955
956RFC 8478 application/zstd October 2018
957
958
9593.1.1.3.1.3. RLE_Literals_Block
960
961 Stream_1 consists of a single byte that should be repeated
962 Regenerated_Size times to generate the decoded literals.
963
9643.1.1.3.1.4. Compressed_Literals_Block and Treeless_Literals_Block
965
966 Both of these modes contain Huffman-encoded data. For
967 Treeless_Literals_Block, the Huffman table comes from the previously
968 compressed literals block, or from a dictionary; see Section 5.
969
9703.1.1.3.1.5. Huffman_Tree_Description
971
972 This section is only present when the Literals_Block_Type type is
973 Compressed_Literals_Block (2). The format of
974 Huffman_Tree_Description can be found in Section 4.2.1. The size of
975 Huffman_Tree_Description is determined during the decoding process.
976 It must be used to determine where streams begin.
977
978 Total_Streams_Size = Compressed_Size
979 - Huffman_Tree_Description_Size
980
9813.1.1.3.1.6. Jump_Table
982
983 The Jump_Table is only present when there are 4 Huffman-coded
984 streams.
985
986 (Reminder: Huffman-compressed data consists of either 1 or 4 Huffman-
987 coded streams.)
988
989 If only 1 stream is present, it is a single bitstream occupying the
990 entire remaining portion of the literals block, encoded as described
991 within Section 4.2.2.
992
993 If there are 4 streams, Literals_Section_Header only provides enough
994 information to know the decompressed and compressed sizes of all 4
995 streams combined. The decompressed size of each stream is equal to
996 (Regenerated_Size+3)/4, except for the last stream, which may be up
997 to 3 bytes smaller, to reach a total decompressed size as specified
998 in Regenerated_Size.
999
1000 The compressed size of each stream is provided explicitly in the
1001 Jump_Table. The Jump_Table is 6 bytes long and consists of three
1002 2-byte little-endian fields, describing the compressed sizes of the
1003 first 3 streams. Stream4_Size is computed from Total_Streams_Size
1004 minus sizes of other streams.
1005
1006
1007
1008
1009
1010Collet & Kucherawy Informational [Page 18]
1011
1012RFC 8478 application/zstd October 2018
1013
1014
1015 Stream4_Size = Total_Streams_Size - 6
1016 - Stream1_Size - Stream2_Size
1017 - Stream3_Size
1018
1019 Note that if Stream1_Size + Stream2_Size + Stream3_Size exceeds
1020 Total_Streams_Size, the data are considered corrupted.
1021
1022 Each of these 4 bitstreams is then decoded independently as a
1023 Huffman-Coded stream, as described in Section 4.2.2.
1024
10253.1.1.3.2. Sequences_Section
1026
1027 A compressed block is a succession of sequences. A sequence is a
1028 literal copy command, followed by a match copy command. A literal
1029 copy command specifies a length. It is the number of bytes to be
1030 copied (or extracted) from the Literals Section. A match copy
1031 command specifies an offset and a length.
1032
1033 When all sequences are decoded, if there are literals left in the
1034 literals section, these bytes are added at the end of the block.
1035
1036 This is described in more detail in Section 3.1.1.4.
1037
1038 The Sequences_Section regroups all symbols required to decode
1039 commands. There are three symbol types: literals lengths, offsets,
1040 and match lengths. They are encoded together, interleaved, in a
1041 single "bitstream".
1042
1043 The Sequences_Section starts by a header, followed by optional
1044 probability tables for each symbol type, followed by the bitstream.
1045
1046 Sequences_Section_Header
1047 [Literals_Length_Table]
1048 [Offset_Table]
1049 [Match_Length_Table]
1050 bitStream
1051
1052 To decode the Sequences_Section, it's necessary to know its size.
1053 This size is deduced from the size of the Literals_Section:
1054 Sequences_Section_Size = Block_Size - Literals_Section_Header -
1055 Literals_Section_Content
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066Collet & Kucherawy Informational [Page 19]
1067
1068RFC 8478 application/zstd October 2018
1069
1070
10713.1.1.3.2.1. Sequences_Section_Header
1072
1073 This header consists of two items:
1074
1075 o Number_of_Sequences
1076
1077 o Symbol_Compression_Modes
1078
1079 Number_of_Sequences is a variable size field using between 1 and 3
1080 bytes. If the first byte is "byte0":
1081
1082 o if (byte0 == 0): there are no sequences. The sequence section
1083 stops here. Decompressed content is defined entirely as Literals
1084 Section content. The FSE tables used in Repeat_Mode are not
1085 updated.
1086
1087 o if (byte0 < 128): Number_of_Sequences = byte0. Uses 1 byte.
1088
1089 o if (byte0 < 255): Number_of_Sequences = ((byte0 - 128) << 8) +
1090 byte1. Uses 2 bytes.
1091
1092 o if (byte0 == 255): Number_of_Sequences = byte1 + (byte2 << 8) +
1093 0x7F00. Uses 3 bytes.
1094
1095 Symbol_Compression_Modes is a single byte, defining the compression
1096 mode of each symbol type.
1097
1098 +-------------+----------------------+
1099 | Bit Number | Field Name |
1100 +-------------+----------------------+
1101 | 7-6 | Literal_Lengths_Mode |
1102 +-------------+----------------------+
1103 | 5-4 | Offsets_Mode |
1104 +-------------+----------------------+
1105 | 3-2 | Match_Lengths_Mode |
1106 +-------------+----------------------+
1107 | 1-0 | Reserved |
1108 +-------------+----------------------+
1109
1110 The last field, Reserved, must be all zeroes.
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122Collet & Kucherawy Informational [Page 20]
1123
1124RFC 8478 application/zstd October 2018
1125
1126
1127 Literals_Lengths_Mode, Offsets_Mode, and Match_Lengths_Mode define
1128 the Compression_Mode of literals lengths, offsets, and match lengths
1129 symbols, respectively. They follow the same enumeration:
1130
1131 +-------+---------------------+
1132 | Value | Compression_Mode |
1133 +-------+---------------------+
1134 | 0 | Predefined_Mode |
1135 +-------+---------------------+
1136 | 1 | RLE_Mode |
1137 +-------+---------------------+
1138 | 2 | FSE_Compressed_Mode |
1139 +-------+---------------------+
1140 | 3 | Repeat_Mode |
1141 +-------+---------------------+
1142
1143 Predefined_Mode: A predefined FSE (see Section 4.1) distribution
1144 table is used, as defined in Section 3.1.1.3.2.2. No distribution
1145 table will be present.
1146
1147 RLE_Mode: The table description consists of a single byte, which
1148 contains the symbol's value. This symbol will be used for all
1149 sequences.
1150
1151 FSE_Compressed_Mode: Standard FSE compression. A distribution table
1152 will be present. The format of this distribution table is
1153 described in Section 4.1.1. Note that the maximum allowed
1154 accuracy log for literals length and match length tables is 9, and
1155 the maximum accuracy log for the offsets table is 8. This mode
1156 must not be used when only one symbol is present; RLE_Mode should
1157 be used instead (although any other mode will work).
1158
1159 Repeat_Mode: The table used in the previous Compressed_Block with
1160 Number_Of_Sequences > 0 will be used again, or if this is the
1161 first block, the table in the dictionary will be used. Note that
1162 this includes RLE_Mode, so if Repeat_Mode follows RLE_Mode, the
1163 same symbol will be repeated. It also includes Predefined_Mode,
1164 in which case Repeat_Mode will have the same outcome as
1165 Predefined_Mode. No distribution table will be present. If this
1166 mode is used without any previous sequence table in the frame (or
1167 dictionary; see Section 5) to repeat, this should be treated as
1168 corruption.
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178Collet & Kucherawy Informational [Page 21]
1179
1180RFC 8478 application/zstd October 2018
1181
1182
11833.1.1.3.2.1.1. Sequence Codes for Lengths and Offsets
1184
1185 Each symbol is a code in its own context, which specifies Baseline
1186 and Number_of_Bits to add. Codes are FSE compressed and interleaved
1187 with raw additional bits in the same bitstream.
1188
1189 Literals length codes are values ranging from 0 to 35 inclusive.
1190 They define lengths from 0 to 131071 bytes. The literals length is
1191 equal to the decoded Baseline plus the result of reading
1192 Number_of_Bits bits from the bitstream, as a little-endian value.
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234Collet & Kucherawy Informational [Page 22]
1235
1236RFC 8478 application/zstd October 2018
1237
1238
1239 +----------------------+----------+----------------+
1240 | Literals_Length_Code | Baseline | Number_of_Bits |
1241 +----------------------+----------+----------------+
1242 | 0-15 | length | 0 |
1243 +----------------------+----------+----------------+
1244 | 16 | 16 | 1 |
1245 +----------------------+----------+----------------+
1246 | 17 | 18 | 1 |
1247 +----------------------+----------+----------------+
1248 | 18 | 20 | 1 |
1249 +----------------------+----------+----------------+
1250 | 19 | 22 | 1 |
1251 +----------------------+----------+----------------+
1252 | 20 | 24 | 2 |
1253 +----------------------+----------+----------------+
1254 | 21 | 28 | 2 |
1255 +----------------------+----------+----------------+
1256 | 22 | 32 | 3 |
1257 +----------------------+----------+----------------+
1258 | 23 | 40 | 3 |
1259 +----------------------+----------+----------------+
1260 | 24 | 48 | 4 |
1261 +----------------------+----------+----------------+
1262 | 25 | 64 | 6 |
1263 +----------------------+----------+----------------+
1264 | 26 | 128 | 7 |
1265 +----------------------+----------+----------------+
1266 | 27 | 256 | 8 |
1267 +----------------------+----------+----------------+
1268 | 28 | 512 | 9 |
1269 +----------------------+----------+----------------+
1270 | 29 | 1024 | 10 |
1271 +----------------------+----------+----------------+
1272 | 30 | 2048 | 11 |
1273 +----------------------+----------+----------------+
1274 | 31 | 4096 | 12 |
1275 +----------------------+----------+----------------+
1276 | 32 | 8192 | 13 |
1277 +----------------------+----------+----------------+
1278 | 33 | 16384 | 14 |
1279 +----------------------+----------+----------------+
1280 | 34 | 32768 | 15 |
1281 +----------------------+----------+----------------+
1282 | 35 | 65536 | 16 |
1283 +----------------------+----------+----------------+
1284
1285
1286
1287
1288
1289
1290Collet & Kucherawy Informational [Page 23]
1291
1292RFC 8478 application/zstd October 2018
1293
1294
1295 Match length codes are values ranging from 0 to 52 inclusive. They
1296 define lengths from 3 to 131074 bytes. The match length is equal to
1297 the decoded Baseline plus the result of reading Number_of_Bits bits
1298 from the bitstream, as a little-endian value.
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346Collet & Kucherawy Informational [Page 24]
1347
1348RFC 8478 application/zstd October 2018
1349
1350
1351 +-------------------+-----------------------+----------------+
1352 | Match_Length_Code | Baseline | Number_of_Bits |
1353 +-------------------+-----------------------+----------------+
1354 | 0-31 | Match_Length_Code + 3 | 0 |
1355 +-------------------+-----------------------+----------------+
1356 | 32 | 35 | 1 |
1357 +-------------------+-----------------------+----------------+
1358 | 33 | 37 | 1 |
1359 +-------------------+-----------------------+----------------+
1360 | 34 | 39 | 1 |
1361 +-------------------+-----------------------+----------------+
1362 | 35 | 41 | 1 |
1363 +-------------------+-----------------------+----------------+
1364 | 36 | 43 | 2 |
1365 +-------------------+-----------------------+----------------+
1366 | 37 | 47 | 2 |
1367 +-------------------+-----------------------+----------------+
1368 | 38 | 51 | 3 |
1369 +-------------------+-----------------------+----------------+
1370 | 39 | 59 | 3 |
1371 +-------------------+-----------------------+----------------+
1372 | 40 | 67 | 4 |
1373 +-------------------+-----------------------+----------------+
1374 | 41 | 83 | 4 |
1375 +-------------------+-----------------------+----------------+
1376 | 42 | 99 | 5 |
1377 +-------------------+-----------------------+----------------+
1378 | 43 | 131 | 7 |
1379 +-------------------+-----------------------+----------------+
1380 | 44 | 259 | 8 |
1381 +-------------------+-----------------------+----------------+
1382 | 45 | 515 | 9 |
1383 +-------------------+-----------------------+----------------+
1384 | 46 | 1027 | 10 |
1385 +-------------------+-----------------------+----------------+
1386 | 47 | 2051 | 11 |
1387 +-------------------+-----------------------+----------------+
1388 | 48 | 4099 | 12 |
1389 +-------------------+-----------------------+----------------+
1390 | 49 | 8195 | 13 |
1391 +-------------------+-----------------------+----------------+
1392 | 50 | 16387 | 14 |
1393 +-------------------+-----------------------+----------------+
1394 | 51 | 32771 | 15 |
1395 +-------------------+-----------------------+----------------+
1396 | 52 | 65539 | 16 |
1397 +-------------------+-----------------------+----------------+
1398
1399
1400
1401
1402Collet & Kucherawy Informational [Page 25]
1403
1404RFC 8478 application/zstd October 2018
1405
1406
1407 Offset codes are values ranging from 0 to N.
1408
1409 A decoder is free to limit its maximum supported value for N.
1410 Support for values of at least 22 is recommended. At the time of
1411 this writing, the reference decoder supports a maximum N value of 31.
1412
1413 An offset code is also the number of additional bits to read in
1414 little-endian fashion and can be translated into an Offset_Value
1415 using the following formulas:
1416
1417 Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
1418 if (Offset_Value > 3) Offset = Offset_Value - 3;
1419
1420 This means that maximum Offset_Value is (2^(N+1))-1, supporting back-
1421 reference distance up to (2^(N+1))-4, but it is limited by the
1422 maximum back-reference distance (see Section 3.1.1.1.2).
1423
1424 Offset_Value from 1 to 3 are special: they define "repeat codes".
1425 This is described in more detail in Section 3.1.1.5.
1426
14273.1.1.3.2.1.2. Decoding Sequences
1428
1429 FSE bitstreams are read in reverse of the direction they are written.
1430 In zstd, the compressor writes bits forward into a block, and the
1431 decompressor must read the bitstream backwards.
1432
1433 To find the start of the bitstream, it is therefore necessary to know
1434 the offset of the last byte of the block, which can be found by
1435 counting Block_Size bytes after the block header.
1436
1437 After writing the last bit containing information, the compressor
1438 writes a single 1 bit and then fills the byte with 0-7 zero bits of
1439 padding. The last byte of the compressed bitstream cannot be zero
1440 for that reason.
1441
1442 When decompressing, the last byte containing the padding is the first
1443 byte to read. The decompressor needs to skip 0-7 initial zero bits
1444 until the first 1 bit occurs. Afterwards, the useful part of the
1445 bitstream begins.
1446
1447 FSE decoding requires a 'state' to be carried from symbol to symbol.
1448 For more explanation on FSE decoding, see Section 4.1.
1449
1450 For sequence decoding, a separate state keeps track of each literal
1451 lengths, offsets, and match lengths symbols. Some FSE primitives are
1452 also used. For more details on the operation of these primitives,
1453 see Section 4.1.
1454
1455
1456
1457
1458Collet & Kucherawy Informational [Page 26]
1459
1460RFC 8478 application/zstd October 2018
1461
1462
1463 The bitstream starts with initial FSE state values, each using the
1464 required number of bits in their respective accuracy, decoded
1465 previously from their normalized distribution. It starts with
1466 Literals_Length_State, followed by Offset_State, and finally
1467 Match_Length_State.
1468
1469 Note that all values are read backward, so the 'start' of the
1470 bitstream is at the highest position in memory, immediately before
1471 the last 1 bit for padding.
1472
1473 After decoding the starting states, a single sequence is decoded
1474 Number_Of_Sequences times. These sequences are decoded in order from
1475 first to last. Since the compressor writes the bitstream in the
1476 forward direction, this means the compressor must encode the
1477 sequences starting with the last one and ending with the first.
1478
1479 For each of the symbol types, the FSE state can be used to determine
1480 the appropriate code. The code then defines the Baseline and
1481 Number_of_Bits to read for each type. The description of the codes
1482 for how to determine these values can be found in
1483 Section 3.1.1.3.2.1.
1484
1485 Decoding starts by reading the Number_of_Bits required to decode
1486 offset. It does the same for Match_Length and then for
1487 Literals_Length. This sequence is then used for Sequence Execution
1488 (see Section 3.1.1.4).
1489
1490 If it is not the last sequence in the block, the next operation is to
1491 update states. Using the rules pre-calculated in the decoding
1492 tables, Literals_Length_State is updated, followed by
1493 Match_Length_State, and then Offset_State. See Section 4.1 for
1494 details on how to update states from the bitstream.
1495
1496 This operation will be repeated Number_of_Sequences times. At the
1497 end, the bitstream shall be entirely consumed; otherwise, the
1498 bitstream is considered corrupted.
1499
15003.1.1.3.2.2. Default Distributions
1501
1502 If Predefined_Mode is selected for a symbol type, its FSE decoding
1503 table is generated from a predefined distribution table defined here.
1504 For details on how to convert this distribution into a decoding
1505 table, see Section 4.1.
1506
1507
1508
1509
1510
1511
1512
1513
1514Collet & Kucherawy Informational [Page 27]
1515
1516RFC 8478 application/zstd October 2018
1517
1518
15193.1.1.3.2.2.1. Literals Length
1520
1521 The decoding table uses an accuracy log of 6 bits (64 states).
1522
1523 short literalsLength_defaultDistribution[36] =
1524 { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
1525 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
1526 -1,-1,-1,-1
1527 };
1528
15293.1.1.3.2.2.2. Match Length
1530
1531 The decoding table uses an accuracy log of 6 bits (64 states).
1532
1533 short matchLengths_defaultDistribution[53] =
1534 { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
1535 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1536 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
1537 -1,-1,-1,-1,-1
1538 };
1539
15403.1.1.3.2.2.3. Offset Codes
1541
1542 The decoding table uses an accuracy log of 5 bits (32 states), and
1543 supports a maximum N value of 28, allowing offset values up to
1544 536,870,908.
1545
1546 If any sequence in the compressed block requires a larger offset than
1547 this, it's not possible to use the default distribution to represent
1548 it.
1549
1550 short offsetCodes_defaultDistribution[29] =
1551 { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
1552 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1
1553 };
1554
15553.1.1.4. Sequence Execution
1556
1557 Once literals and sequences have been decoded, they are combined to
1558 produce the decoded content of a block.
1559
1560 Each sequence consists of a tuple of (literals_length, offset_value,
1561 match_length), decoded as described in the
1562 Sequences_Section (Section 3.1.1.3.2). To execute a sequence, first
1563 copy literals_length bytes from the decoded literals to the output.
1564
1565
1566
1567
1568
1569
1570Collet & Kucherawy Informational [Page 28]
1571
1572RFC 8478 application/zstd October 2018
1573
1574
1575 Then, match_length bytes are copied from previous decoded data. The
1576 offset to copy from is determined by offset_value:
1577
1578 o if Offset_Value > 3, then the offset is Offset_Value - 3;
1579
1580 o if Offset_Value is from 1-3, the offset is a special repeat offset
1581 value. See Section 3.1.1.5 for how the offset is determined in
1582 this case.
1583
1584 The offset is defined as from the current position (after copying the
1585 literals), so an offset of 6 and a match length of 3 means that 3
1586 bytes should be copied from 6 bytes back. Note that all offsets
1587 leading to previously decoded data must be smaller than Window_Size
1588 defined in Frame_Header_Descriptor (Section 3.1.1.1.1).
1589
15903.1.1.5. Repeat Offsets
1591
1592 As seen above, the first three values define a repeated offset; we
1593 will call them Repeated_Offset1, Repeated_Offset2, and
1594 Repeated_Offset3. They are sorted in recency order, with
1595 Repeated_Offset1 meaning "most recent one".
1596
1597 If offset_value is 1, then the offset used is Repeated_Offset1, etc.
1598
1599 There is one exception: When the current sequence's literals_length
1600 is 0, repeated offsets are shifted by 1, so an offset_value of 1
1601 means Repeated_Offset2, an offset_value of 2 means Repeated_Offset3,
1602 and an offset_value of 3 means Repeated_Offset1 - 1_byte.
1603
1604 For the first block, the starting offset history is populated with
1605 the following values: Repeated_Offset1 (1), Repeated_Offset2 (4), and
1606 Repeated_Offset3 (8), unless a dictionary is used, in which case they
1607 come from the dictionary.
1608
1609 Then each block gets its starting offset history from the ending
1610 values of the most recent Compressed_Block. Note that blocks that
1611 are not Compressed_Block are skipped; they do not contribute to
1612 offset history.
1613
1614 The newest offset takes the lead in offset history, shifting others
1615 back (up to its previous place if it was already present). This
1616 means that when Repeated_Offset1 (most recent) is used, history is
1617 unmodified. When Repeated_Offset2 is used, it is swapped with
1618 Repeated_Offset1. If any other offset is used, it becomes
1619 Repeated_Offset1, and the rest are shifted back by 1.
1620
1621
1622
1623
1624
1625
1626Collet & Kucherawy Informational [Page 29]
1627
1628RFC 8478 application/zstd October 2018
1629
1630
16313.1.2. Skippable Frames
1632
1633 +--------------+------------+-----------+
1634 | Magic_Number | Frame_Size | User_Data |
1635 +--------------+------------+-----------+
1636 | 4 bytes | 4 bytes | n bytes |
1637 +--------------+------------+-----------+
1638
1639 Skippable frames allow the insertion of user-defined metadata into a
1640 flow of concatenated frames.
1641
1642 Skippable frames defined in this specification are compatible with
1643 skippable frames in [LZ4].
1644
1645 From a compliant decoder perspective, skippable frames simply need to
1646 be skipped, and their content ignored, resuming decoding after the
1647 skippable frame.
1648
1649 It should be noted that a skippable frame can be used to watermark a
1650 stream of concatenated frames embedding any kind of tracking
1651 information (even just a Universally Unique Identifier (UUID)).
1652 Users wary of such possibility should scan the stream of concatenated
1653 frames in an attempt to detect such frames for analysis or removal.
1654
1655 The fields are:
1656
1657 Magic_Number: 4 bytes, little-endian format. Value: 0x184D2A5?,
1658 which means any value from 0x184D2A50 to 0x184D2A5F. All 16
1659 values are valid to identify a skippable frame. This
1660 specification does not detail any specific tagging methods for
1661 skippable frames.
1662
1663 Frame_Size: This is the size, in bytes, of the following User_Data
1664 (without including the magic number nor the size field itself).
1665 This field is represented using 4 bytes, little-endian format,
1666 unsigned 32 bits. This means User_Data can't be bigger than
1667 (2^32-1) bytes.
1668
1669 User_Data: This field can be anything. Data will just be skipped by
1670 the decoder.
1671
16724. Entropy Encoding
1673
1674 Two types of entropy encoding are used by the Zstandard format: FSE
1675 and Huffman coding. Huffman is used to compress literals, while FSE
1676 is used for all other symbols (Literals_Length_Code,
1677 Match_Length_Code, and offset codes) and to compress Huffman headers.
1678
1679
1680
1681
1682Collet & Kucherawy Informational [Page 30]
1683
1684RFC 8478 application/zstd October 2018
1685
1686
16874.1. FSE
1688
1689 FSE, short for Finite State Entropy, is an entropy codec based on
1690 [ANS]. FSE encoding/decoding involves a state that is carried over
1691 between symbols, so decoding must be done in the opposite direction
1692 as encoding. Therefore, all FSE bitstreams are read from end to
1693 beginning. Note that the order of the bits in the stream is not
1694 reversed; they are simply read in the reverse order from which they
1695 were written.
1696
1697 For additional details on FSE, see Finite State Entropy [FSE].
1698
1699 FSE decoding involves a decoding table that has a power of 2 size and
1700 contains three elements: Symbol, Num_Bits, and Baseline. The base 2
1701 logarithm of the table size is its Accuracy_Log. An FSE state value
1702 represents an index in this table.
1703
1704 To obtain the initial state value, consume Accuracy_Log bits from the
1705 stream as a little-endian value. The next symbol in the stream is
1706 the Symbol indicated in the table for that state. To obtain the next
1707 state value, the decoder should consume Num_Bits bits from the stream
1708 as a little-endian value and add it to Baseline.
1709
17104.1.1. FSE Table Description
1711
1712 To decode FSE streams, it is necessary to construct the decoding
1713 table. The Zstandard format encodes FSE table descriptions as
1714 described here.
1715
1716 An FSE distribution table describes the probabilities of all symbols
1717 from 0 to the last present one (included) on a normalized scale of
1718 (1 << Accuracy_Log). Note that there must be two or more symbols
1719 with non-zero probability.
1720
1721 A bitstream is read forward, in little-endian fashion. It is not
1722 necessary to know its exact size, since the size will be discovered
1723 and reported by the decoding process. The bitstream starts by
1724 reporting on which scale it operates. If low4bits designates the
1725 lowest 4 bits of the first byte, then Accuracy_Log = low4bits + 5.
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738Collet & Kucherawy Informational [Page 31]
1739
1740RFC 8478 application/zstd October 2018
1741
1742
1743 This is followed by each symbol value, from 0 to the last present
1744 one. The number of bits used by each field is variable and depends
1745 on:
1746
1747 Remaining probabilities + 1: For example, presuming an Accuracy_Log
1748 of 8, and presuming 100 probabilities points have already been
1749 distributed, the decoder may read any value from 0 to
1750 (256 - 100 + 1) == 157, inclusive. Therefore, it must read
1751 log2sup(157) == 8 bits.
1752
1753 Value decoded: Small values use 1 fewer bit. For example, presuming
1754 values from 0 to 157 (inclusive) are possible, 255 - 157 = 98
1755 values are remaining in an 8-bit field. The first 98 values
1756 (hence from 0 to 97) use only 7 bits, and values from 98 to 157
1757 use 8 bits. This is achieved through this scheme:
1758
1759 +------------+---------------+-----------+
1760 | Value Read | Value Decoded | Bits Used |
1761 +------------+---------------+-----------+
1762 | 0 - 97 | 0 - 97 | 7 |
1763 +------------+---------------+-----------+
1764 | 98 - 127 | 98 - 127 | 8 |
1765 +------------+---------------+-----------+
1766 | 128 - 225 | 0 - 97 | 7 |
1767 +------------+---------------+-----------+
1768 | 226 - 255 | 128 - 157 | 8 |
1769 +------------+---------------+-----------+
1770
1771 Symbol probabilities are read one by one, in order. The probability
1772 is obtained from Value decoded using the formula P = Value - 1. This
1773 means the value 0 becomes the negative probability -1. This is a
1774 special probability that means "less than 1". Its effect on the
1775 distribution table is described below. For the purpose of
1776 calculating total allocated probability points, it counts as 1.
1777
1778 When a symbol has a probability of zero, it is followed by a 2-bit
1779 repeat flag. This repeat flag tells how many probabilities of zeroes
1780 follow the current one. It provides a number ranging from 0 to 3.
1781 If it is a 3, another 2-bit repeat flag follows, and so on.
1782
1783 When the last symbol reaches a cumulated total of
1784 (1 << Accuracy_Log), decoding is complete. If the last symbol makes
1785 the cumulated total go above (1 << Accuracy_Log), distribution is
1786 considered corrupted.
1787
1788
1789
1790
1791
1792
1793
1794Collet & Kucherawy Informational [Page 32]
1795
1796RFC 8478 application/zstd October 2018
1797
1798
1799 Finally, the decoder can tell how many bytes were used in this
1800 process and how many symbols are present. The bitstream consumes a
1801 round number of bytes. Any remaining bit within the last byte is
1802 simply unused.
1803
1804 The distribution of normalized probabilities is enough to create a
1805 unique decoding table. The table has a size of (1 << Accuracy_Log).
1806 Each cell describes the symbol decoded and instructions to get the
1807 next state.
1808
1809 Symbols are scanned in their natural order for "less than 1"
1810 probabilities as described above. Symbols with this probability are
1811 being attributed a single cell, starting from the end of the table
1812 and retreating. These symbols define a full state reset, reading
1813 Accuracy_Log bits.
1814
1815 All remaining symbols are allocated in their natural order. Starting
1816 from symbol 0 and table position 0, each symbol gets allocated as
1817 many cells as its probability. Cell allocation is spread, not
1818 linear; each successor position follows this rule:
1819
1820 position += (tableSize >> 1) + (tableSize >> 3) + 3;
1821 position &= tableSize - 1;
1822
1823 A position is skipped if it is already occupied by a "less than 1"
1824 probability symbol. Position does not reset between symbols; it
1825 simply iterates through each position in the table, switching to the
1826 next symbol when enough states have been allocated to the current
1827 one.
1828
1829 The result is a list of state values. Each state will decode the
1830 current symbol.
1831
1832 To get the Number_of_Bits and Baseline required for the next state,
1833 it is first necessary to sort all states in their natural order. The
1834 lower states will need 1 more bit than higher ones. The process is
1835 repeated for each symbol.
1836
1837 For example, presuming a symbol has a probability of 5, it receives
1838 five state values. States are sorted in natural order. The next
1839 power of 2 is 8. The space of probabilities is divided into 8 equal
1840 parts. Presuming the Accuracy_Log is 7, this defines 128 states, and
1841 each share (divided by 8) is 16 in size. In order to reach 8, 8 - 5
1842 = 3 lowest states will count "double", doubling the number of shares
1843 (32 in width), requiring 1 more bit in the process.
1844
1845
1846
1847
1848
1849
1850Collet & Kucherawy Informational [Page 33]
1851
1852RFC 8478 application/zstd October 2018
1853
1854
1855 Baseline is assigned starting from the higher states using fewer
1856 bits, and proceeding naturally, then resuming at the first state,
1857 each taking its allocated width from Baseline.
1858
1859 +----------------+-------+-------+--------+------+-------+
1860 | state order | 0 | 1 | 2 | 3 | 4 |
1861 +----------------+-------+-------+--------+------+-------+
1862 | width | 32 | 32 | 32 | 16 | 16 |
1863 +----------------+-------+-------+--------+------+-------+
1864 | Number_of_Bits | 5 | 5 | 5 | 4 | 4 |
1865 +----------------+-------+-------+--------+------+-------+
1866 | range number | 2 | 4 | 6 | 0 | 1 |
1867 +----------------+-------+-------+--------+------+-------+
1868 | Baseline | 32 | 64 | 96 | 0 | 16 |
1869 +----------------+-------+-------+--------+------+-------+
1870 | range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
1871 +----------------+-------+-------+--------+------+-------+
1872
1873 The next state is determined from the current state by reading the
1874 required Number_of_Bits and adding the specified Baseline.
1875
1876 See Appendix A for the results of this process that are applied to
1877 the default distributions.
1878
18794.2. Huffman Coding
1880
1881 Zstandard Huffman-coded streams are read backwards, similar to the
1882 FSE bitstreams. Therefore, to find the start of the bitstream, it is
1883 necessary to know the offset of the last byte of the Huffman-coded
1884 stream.
1885
1886 After writing the last bit containing information, the compressor
1887 writes a single 1 bit and then fills the byte with 0-7 0 bits of
1888 padding. The last byte of the compressed bitstream cannot be 0 for
1889 that reason.
1890
1891 When decompressing, the last byte containing the padding is the first
1892 byte to read. The decompressor needs to skip 0-7 initial 0 bits and
1893 the first 1 bit that occurs. Afterwards, the useful part of the
1894 bitstream begins.
1895
1896 The bitstream contains Huffman-coded symbols in little-endian order,
1897 with the codes defined by the method below.
1898
1899
1900
1901
1902
1903
1904
1905
1906Collet & Kucherawy Informational [Page 34]
1907
1908RFC 8478 application/zstd October 2018
1909
1910
19114.2.1. Huffman Tree Description
1912
1913 Prefix coding represents symbols from an a priori known alphabet by
1914 bit sequences (codewords), one codeword for each symbol, in a manner
1915 such that different symbols may be represented by bit sequences of
1916 different lengths, but a parser can always parse an encoded string
1917 unambiguously symbol by symbol.
1918
1919 Given an alphabet with known symbol frequencies, the Huffman
1920 algorithm allows the construction of an optimal prefix code using the
1921 fewest bits of any possible prefix codes for that alphabet.
1922
1923 The prefix code must not exceed a maximum code length. More bits
1924 improve accuracy but yield a larger header size and require more
1925 memory or more complex decoding operations. This specification
1926 limits the maximum code length to 11 bits.
1927
1928 All literal values from zero (included) to the last present one
1929 (excluded) are represented by Weight with values from 0 to
1930 Max_Number_of_Bits. Transformation from Weight to Number_of_Bits
1931 follows this pseudocode:
1932
1933 if Weight == 0
1934 Number_of_Bits = 0
1935 else
1936 Number_of_Bits = Max_Number_of_Bits + 1 - Weight
1937
1938 The last symbol's Weight is deduced from previously decoded ones, by
1939 completing to the nearest power of 2. This power of 2 gives
1940 Max_Number_of_Bits the depth of the current tree.
1941
1942 For example, presume the following Huffman tree must be described:
1943
1944 +---------------+----------------+
1945 | Literal Value | Number_of_Bits |
1946 +---------------+----------------+
1947 | 0 | 1 |
1948 +---------------+----------------+
1949 | 1 | 2 |
1950 +---------------+----------------+
1951 | 2 | 3 |
1952 +---------------+----------------+
1953 | 3 | 0 |
1954 +---------------+----------------+
1955 | 4 | 4 |
1956 +---------------+----------------+
1957 | 5 | 4 |
1958 +---------------+----------------+
1959
1960
1961
1962Collet & Kucherawy Informational [Page 35]
1963
1964RFC 8478 application/zstd October 2018
1965
1966
1967 The tree depth is 4, since its longest element uses 4 bits. (The
1968 longest elements are those with the smallest frequencies.) Value 5
1969 will not be listed as it can be determined from the values for 0-4,
1970 nor will values above 5 as they are all 0. Values from 0 to 4 will
1971 be listed using Weight instead of Number_of_Bits. The pseudocode to
1972 determine Weight is:
1973
1974 if Number_of_Bits == 0
1975 Weight = 0
1976 else
1977 Weight = Max_Number_of_Bits + 1 - Number_of_Bits
1978
1979 It gives the following series of weights:
1980
1981 +---------------+--------+
1982 | Literal Value | Weight |
1983 +---------------+--------+
1984 | 0 | 4 |
1985 +---------------+--------+
1986 | 1 | 3 |
1987 +---------------+--------+
1988 | 2 | 2 |
1989 +---------------+--------+
1990 | 3 | 0 |
1991 +---------------+--------+
1992 | 4 | 1 |
1993 +---------------+--------+
1994
1995 The decoder will do the inverse operation: having collected weights
1996 of literals from 0 to 4, it knows the last literal, 5, is present
1997 with a non-zero Weight. The Weight of 5 can be determined by
1998 advancing to the next power of 2. The sum of 2^(Weight-1) (excluding
1999 0's) is 15. The nearest power of 2 is 16. Therefore,
2000 Max_Number_of_Bits = 4 and Weight[5] = 16 - 15 = 1.
2001
20024.2.1.1. Huffman Tree Header
2003
2004 This is a single byte value (0-255), which describes how the series
2005 of weights is encoded.
2006
2007 headerByte < 128: The series of weights is compressed using FSE (see
2008 below). The length of the FSE-compressed series is equal to
2009 headerByte (0-127).
2010
2011
2012
2013
2014
2015
2016
2017
2018Collet & Kucherawy Informational [Page 36]
2019
2020RFC 8478 application/zstd October 2018
2021
2022
2023 headerByte >= 128: This is a direct representation, where each
2024 Weight is written directly as a 4-bit field (0-15). They are
2025 encoded forward, 2 weights to a byte with the first weight taking
2026 the top 4 bits and the second taking the bottom 4; for example,
2027 the following operations could be used to read the weights:
2028
2029 Weight[0] = (Byte[0] >> 4)
2030 Weight[1] = (Byte[0] & 0xf),
2031 etc.
2032
2033 The full representation occupies ceiling(Number_of_Symbols/2)
2034 bytes, meaning it uses only full bytes even if Number_of_Symbols
2035 is odd. Number_of_Symbols = headerByte - 127. Note that maximum
2036 Number_of_Symbols is 255 - 127 = 128. If any literal has a value
2037 over 128, raw header mode is not possible, and it is necessary to
2038 use FSE compression.
2039
20404.2.1.2. FSE Compression of Huffman Weights
2041
2042 In this case, the series of Huffman weights is compressed using FSE
2043 compression. It is a single bitstream with two interleaved states,
2044 sharing a single distribution table.
2045
2046 To decode an FSE bitstream, it is necessary to know its compressed
2047 size. Compressed size is provided by headerByte. It's also
2048 necessary to know its maximum possible decompressed size, which is
2049 255, since literal values span from 0 to 255, and the last symbol's
2050 Weight is not represented.
2051
2052 An FSE bitstream starts by a header, describing probabilities
2053 distribution. It will create a decoding table. For a list of
2054 Huffman weights, the maximum accuracy log is 6 bits. For more
2055 details, see Section 4.1.1.
2056
2057 The Huffman header compression uses two states, which share the same
2058 FSE distribution table. The first state (State1) encodes the even-
2059 numbered index symbols, and the second (State2) encodes the odd-
2060 numbered index symbols. State1 is initialized first, and then
2061 State2, and they take turns decoding a single symbol and updating
2062 their state. For more details on these FSE operations, see
2063 Section 4.1.
2064
2065 The number of symbols to be decoded is determined by tracking the
2066 bitStream overflow condition: If updating state after decoding a
2067 symbol would require more bits than remain in the stream, it is
2068 assumed that extra bits are zero. Then, symbols for each of the
2069 final states are decoded and the process is complete.
2070
2071
2072
2073
2074Collet & Kucherawy Informational [Page 37]
2075
2076RFC 8478 application/zstd October 2018
2077
2078
20794.2.1.3. Conversion from Weights to Huffman Prefix Codes
2080
2081 All present symbols will now have a Weight value. It is possible to
2082 transform weights into Number_of_Bits, using this formula:
2083
2084 if Weight > 0
2085 Number_of_Bits = Max_Number_of_Bits + 1 - Weight
2086 else
2087 Number_of_Bits = 0
2088
2089 Symbols are sorted by Weight. Within the same Weight, symbols keep
2090 natural sequential order. Symbols with a Weight of zero are removed.
2091 Then, starting from the lowest Weight, prefix codes are distributed
2092 in sequential order.
2093
2094 For example, assume the following list of weights has been decoded:
2095
2096 +---------+--------+
2097 | Literal | Weight |
2098 +---------+--------+
2099 | 0 | 4 |
2100 +---------+--------+
2101 | 1 | 3 |
2102 +---------+--------+
2103 | 2 | 2 |
2104 +---------+--------+
2105 | 3 | 0 |
2106 +---------+--------+
2107 | 4 | 1 |
2108 +---------+--------+
2109 | 5 | 1 |
2110 +---------+--------+
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130Collet & Kucherawy Informational [Page 38]
2131
2132RFC 8478 application/zstd October 2018
2133
2134
2135 Sorting by weight and then the natural sequential order yields the
2136 following distribution:
2137
2138 +---------+--------+----------------+--------------+
2139 | Literal | Weight | Number_Of_Bits | Prefix Codes |
2140 +---------+--------+----------------|--------------+
2141 | 3 | 0 | 0 | N/A |
2142 +---------+--------+----------------|--------------+
2143 | 4 | 1 | 4 | 0000 |
2144 +---------+--------+----------------|--------------+
2145 | 5 | 1 | 4 | 0001 |
2146 +---------+--------+----------------|--------------+
2147 | 2 | 2 | 3 | 001 |
2148 +---------+--------+----------------|--------------+
2149 | 1 | 3 | 2 | 01 |
2150 +---------+--------+----------------|--------------+
2151 | 0 | 4 | 1 | 1 |
2152 +---------+--------+----------------|--------------+
2153
21544.2.2. Huffman-Coded Streams
2155
2156 Given a Huffman decoding table, it is possible to decode a Huffman-
2157 coded stream.
2158
2159 Each bitstream must be read backward, which starts from the end and
2160 goes up to the beginning. Therefore, it is necessary to know the
2161 size of each bitstream.
2162
2163 It is also necessary to know exactly which bit is the last. This is
2164 detected by a final bit flag: the highest bit of the last byte is a
2165 final-bit-flag. Consequently, a last byte of 0 is not possible. And
2166 the final-bit-flag itself is not part of the useful bitstream.
2167 Hence, the last byte contains between 0 and 7 useful bits.
2168
2169 Starting from the end, it is possible to read the bitstream in a
2170 little-endian fashion, keeping track of already used bits. Since the
2171 bitstream is encoded in reverse order, starting from the end, read
2172 symbols in forward order.
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186Collet & Kucherawy Informational [Page 39]
2187
2188RFC 8478 application/zstd October 2018
2189
2190
2191 For example, if the literal sequence "0145" was encoded using the
2192 above prefix code, it would be encoded (in reverse order) as:
2193
2194 +---------+----------+
2195 | Symbol | Encoding |
2196 +---------+----------+
2197 | 5 | 0000 |
2198 +---------+----------+
2199 | 4 | 0001 |
2200 +---------+----------+
2201 | 1 | 01 |
2202 +---------+----------+
2203 | 0 | 1 |
2204 +---------+----------+
2205 | Padding | 00001 |
2206 +---------+----------+
2207
2208 This results in the following 2-byte bitstream:
2209
2210 00010000 00001101
2211
2212 Here is an alternative representation with the symbol codes separated
2213 by underscores:
2214
2215 0001_0000 00001_1_01
2216
2217 Reading the highest Max_Number_of_Bits bits, it's possible to compare
2218 the extracted value to the decoding table, determining the symbol to
2219 decode and number of bits to discard.
2220
2221 The process continues reading up to the required number of symbols
2222 per stream. If a bitstream is not entirely and exactly consumed,
2223 hence reaching exactly its beginning position with all bits consumed,
2224 the decoding process is considered faulty.
2225
22265. Dictionary Format
2227
2228 Zstandard is compatible with "raw content" dictionaries, free of any
2229 format restriction, except that they must be at least 8 bytes. These
2230 dictionaries function as if they were just the content part of a
2231 formatted dictionary.
2232
2233 However, dictionaries created by "zstd --train" in the reference
2234 implementation follow a specific format, described here.
2235
2236 Dictionaries are not included in the compressed content but rather
2237 are provided out of band. That is, the Dictionary_ID identifies
2238 which should be used, but this specification does not describe the
2239
2240
2241
2242Collet & Kucherawy Informational [Page 40]
2243
2244RFC 8478 application/zstd October 2018
2245
2246
2247 mechanism by which the dictionary is obtained prior to use during
2248 compression or decompression.
2249
2250 A dictionary has a size, defined either by a buffer limit or a file
2251 size. The general format is:
2252
2253 +--------------+---------------+----------------+---------+
2254 | Magic_Number | Dictionary_ID | Entropy_Tables | Content |
2255 +--------------+---------------+----------------+---------+
2256
2257 Magic_Number: 4 bytes ID, value 0xEC30A437, little-endian format.
2258
2259 Dictionary_ID: 4 bytes, stored in little-endian format.
2260 Dictionary_ID can be any value, except 0 (which means no
2261 Dictionary_ID). It is used by decoders to check if they use the
2262 correct dictionary. If the frame is going to be distributed in a
2263 private environment, any Dictionary_ID can be used. However, for
2264 public distribution of compressed frames, the following ranges are
2265 reserved and shall not be used:
2266
2267 low range: <= 32767
2268 high range: >= (2^31)
2269
2270 Entropy_Tables: Follow the same format as the tables in compressed
2271 blocks. See the relevant FSE and Huffman sections for how to
2272 decode these tables. They are stored in the following order:
2273 Huffman table for literals, FSE table for offsets, FSE table for
2274 match lengths, and FSE table for literals lengths. These tables
2275 populate the Repeat Stats literals mode and Repeat distribution
2276 mode for sequence decoding. It is finally followed by 3 offset
2277 values, populating repeat offsets (instead of using {1,4,8}),
2278 stored in order, 4-bytes little-endian each, for a total of 12
2279 bytes. Each repeat offset must have a value less than the
2280 dictionary size.
2281
2282 Content: The rest of the dictionary is its content. The content
2283 acts as a "past" in front of data to be compressed or
2284 decompressed, so it can be referenced in sequence commands. As
2285 long as the amount of data decoded from this frame is less than or
2286 equal to Window_Size, sequence commands may specify offsets longer
2287 than the total length of decoded output so far to reference back
2288 to the dictionary, even parts of the dictionary with offsets
2289 larger than Window_Size. After the total output has surpassed
2290 Window_Size, however, this is no longer allowed, and the
2291 dictionary is no longer accessible.
2292
2293
2294
2295
2296
2297
2298Collet & Kucherawy Informational [Page 41]
2299
2300RFC 8478 application/zstd October 2018
2301
2302
23036. IANA Considerations
2304
2305 IANA has made two registrations, as described below.
2306
23076.1. The 'application/zstd' Media Type
2308
2309 The 'application/zstd' media type identifies a block of data that is
2310 compressed using zstd compression. The data is a stream of bytes as
2311 described in this document. IANA has added the following to the
2312 "Media Types" registry:
2313
2314 Type name: application
2315
2316 Subtype name: zstd
2317
2318 Required parameters: N/A
2319
2320 Optional parameters: N/A
2321
2322 Encoding considerations: binary
2323
2324 Security considerations: See Section 7 of RFC 8478
2325
2326 Interoperability considerations: N/A
2327
2328 Published specification: RFC 8478
2329
2330 Applications that use this media type: anywhere data size is an
2331 issue
2332
2333 Additional information:
2334
2335 Magic number(s): 4 bytes, little-endian format.
2336 Value: 0xFD2FB528
2337
2338 File extension(s): zst
2339
2340 Macintosh file type code(s): N/A
2341
2342 For further information: See [ZSTD]
2343
2344 Intended usage: common
2345
2346 Restrictions on usage: N/A
2347
2348 Author: Murray S. Kucherawy
2349
2350 Change Controller: IETF
2351
2352
2353
2354Collet & Kucherawy Informational [Page 42]
2355
2356RFC 8478 application/zstd October 2018
2357
2358
2359 Provisional registration: no
2360
23616.2. Content Encoding
2362
2363 IANA has added the following entry to the "HTTP Content Coding
2364 Registry" within the "Hypertext Transfer Protocol (HTTP) Parameters"
2365 registry:
2366
2367 Name: zstd
2368
2369 Description: A stream of bytes compressed using the Zstandard
2370 protocol
2371
2372 Pointer to specification text: RFC 8478
2373
23746.3. Dictionaries
2375
2376 Work in progress includes development of dictionaries that will
2377 optimize compression and decompression of particular types of data.
2378 Specification of such dictionaries for public use will necessitate
2379 registration of a code point from the reserved range described in
2380 Section 3.1.1.1.3 and its association with a specific dictionary.
2381
2382 However, there are at present no such dictionaries published for
2383 public use, so this document makes no immediate request of IANA to
2384 create such a registry.
2385
23867. Security Considerations
2387
2388 Any data compression method involves the reduction of redundancy in
2389 the data. Zstandard is no exception, and the usual precautions
2390 apply.
2391
2392 One should never compress a message whose content must remain secret
2393 with a message generated by a third party. Such a compression can be
2394 used to guess the content of the secret message through analysis of
2395 entropy reduction. This was demonstrated in the Compression Ratio
2396 Info-leak Made Easy (CRIME) attack [CRIME], for example.
2397
2398 A decoder has to demonstrate capabilities to detect and prevent any
2399 kind of data tampering in the compressed frame from triggering system
2400 faults, such as reading or writing beyond allowed memory ranges.
2401 This can be guaranteed by either the implementation language or
2402 careful bound checkings. Of particular note is the encoding of
2403 Number_of_Sequences values that cause the decoder to read into the
2404 block header (and beyond), as well as the indication of a
2405 Frame_Content_Size that is smaller than the actual decompressed data,
2406 in an attempt to trigger a buffer overflow. It is highly recommended
2407
2408
2409
2410Collet & Kucherawy Informational [Page 43]
2411
2412RFC 8478 application/zstd October 2018
2413
2414
2415 to fuzz-test (i.e., provide invalid, unexpected, or random input and
2416 verify safe operation of) decoder implementations to test and harden
2417 their capability to detect bad frames and deal with them without any
2418 adverse system side effect.
2419
2420 An attacker may provide correctly formed compressed frames with
2421 unreasonable memory requirements. A decoder must always control
2422 memory requirements and enforce some (system-specific) limits in
2423 order to protect memory usage from such scenarios.
2424
2425 Compression can be optimized by training a dictionary on a variety of
2426 related content payloads. This dictionary must then be available at
2427 the decoder for decompression of the payload to be possible. While
2428 this document does not specify how to acquire a dictionary for a
2429 given compressed payload, it is worth noting that third-party
2430 dictionaries may interact unexpectedly with a decoder, leading to
2431 possible memory or other resource exhaustion attacks. We expect such
2432 topics to be discussed in further detail in the Security
2433 Considerations section of a forthcoming RFC for dictionary
2434 acquisition and transmission, but highlight this issue now out of an
2435 abundance of caution.
2436
2437 As discussed in Section 3.1.2, it is possible to store arbitrary user
2438 metadata in skippable frames. While such frames are ignored during
2439 decompression of the data, they can be used as a watermark to track
2440 the path of the compressed payload.
2441
24428. Implementation Status
2443
2444 Source code for a C language implementation of a Zstandard-compliant
2445 library is available at [ZSTD-GITHUB]. This implementation is
2446 considered to be the reference implementation and is production
2447 ready; it implements the full range of the specification. It is
2448 routinely tested against security hazards and widely deployed within
2449 Facebook infrastructure.
2450
2451 The reference version is optimized for speed and is highly portable.
2452 It has been proven to run safely on multiple architectures (e.g.,
2453 x86, x64, ARM, MIPS, PowerPC, IA64) featuring 32- or 64-bit
2454 addressing schemes, a little- or big-endian storage scheme, a number
2455 of different operating systems (e.g., UNIX (including Linux, BSD,
2456 OS-X, and Solaris) and Windows), and a number of compilers (e.g.,
2457 gcc, clang, visual, and icc).
2458
2459
2460
2461
2462
2463
2464
2465
2466Collet & Kucherawy Informational [Page 44]
2467
2468RFC 8478 application/zstd October 2018
2469
2470
24719. References
2472
24739.1. Normative References
2474
2475 [ZSTD] "Zstandard", <http://www.zstd.net>.
2476
24779.2. Informative References
2478
2479 [ANS] Duda, J., "Asymmetric numeral systems: entropy coding
2480 combining speed of Huffman coding with compression rate of
2481 arithmetic coding", January 2014,
2482 <https://arxiv.org/pdf/1311.2540>.
2483
2484 [CRIME] "CRIME", June 2018, <https://en.wikipedia.org/w/
2485 index.php?title=CRIME&oldid=844538656>.
2486
2487 [FSE] "FiniteStateEntropy", commit 6efa78a, June 2018,
2488 <https://github.com/Cyan4973/FiniteStateEntropy/>.
2489
2490 [LZ4] "LZ4 Frame Format Description", commit d03224b, January
2491 2018, <https://github.com/lz4/lz4/blob/master/doc/
2492 lz4_Frame_format.md>.
2493
2494 [RFC1952] Deutsch, P., "GZIP file format specification version 4.3",
2495 RFC 1952, DOI 10.17487/RFC1952, May 1996,
2496 <https://www.rfc-editor.org/info/rfc1952>.
2497
2498 [XXHASH] "XXHASH Algorithm", <http://www.xxhash.org>.
2499
2500 [ZSTD-GITHUB]
2501 "zstd", commit 8514bd8, August 2018,
2502 <https://github.com/facebook/zstd>.
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522Collet & Kucherawy Informational [Page 45]
2523
2524RFC 8478 application/zstd October 2018
2525
2526
2527Appendix A. Decoding Tables for Predefined Codes
2528
2529 This appendix contains FSE decoding tables for the predefined literal
2530 length, match length, and offset codes. The tables have been
2531 constructed using the algorithm as given above in Section 4.1.1. The
2532 tables here can be used as examples to crosscheck that an
2533 implementation has built its decoding tables correctly.
2534
2535A.1. Literal Length Code Table
2536
2537 +-------+--------+----------------+------+
2538 | State | Symbol | Number_Of_Bits | Base |
2539 +-------+--------+----------------+------+
2540 | 0 | 0 | 0 | 0 |
2541 +-------+--------+----------------+------+
2542 | 0 | 0 | 4 | 0 |
2543 +-------+--------+----------------+------+
2544 | 1 | 0 | 4 | 16 |
2545 +-------+--------+----------------+------+
2546 | 2 | 1 | 5 | 32 |
2547 +-------+--------+----------------+------+
2548 | 3 | 3 | 5 | 0 |
2549 +-------+--------+----------------+------+
2550 | 4 | 4 | 5 | 0 |
2551 +-------+--------+----------------+------+
2552 | 5 | 6 | 5 | 0 |
2553 +-------+--------+----------------+------+
2554 | 6 | 7 | 5 | 0 |
2555 +-------+--------+----------------+------+
2556 | 7 | 9 | 5 | 0 |
2557 +-------+--------+----------------+------+
2558 | 8 | 10 | 5 | 0 |
2559 +-------+--------+----------------+------+
2560 | 9 | 12 | 5 | 0 |
2561 +-------+--------+----------------+------+
2562 | 10 | 14 | 6 | 0 |
2563 +-------+--------+----------------+------+
2564 | 11 | 16 | 5 | 0 |
2565 +-------+--------+----------------+------+
2566 | 12 | 18 | 5 | 0 |
2567 +-------+--------+----------------+------+
2568 | 13 | 19 | 5 | 0 |
2569 +-------+--------+----------------+------+
2570 | 14 | 21 | 5 | 0 |
2571 +-------+--------+----------------+------+
2572 | 15 | 22 | 5 | 0 |
2573 +-------+--------+----------------+------+
2574 | 16 | 24 | 5 | 0 |
2575
2576
2577
2578Collet & Kucherawy Informational [Page 46]
2579
2580RFC 8478 application/zstd October 2018
2581
2582
2583 +-------+--------+----------------+------+
2584 | 17 | 25 | 5 | 32 |
2585 +-------+--------+----------------+------+
2586 | 18 | 26 | 5 | 0 |
2587 +-------+--------+----------------+------+
2588 | 19 | 27 | 6 | 0 |
2589 +-------+--------+----------------+------+
2590 | 20 | 29 | 6 | 0 |
2591 +-------+--------+----------------+------+
2592 | 21 | 31 | 6 | 0 |
2593 +-------+--------+----------------+------+
2594 | 22 | 0 | 4 | 32 |
2595 +-------+--------+----------------+------+
2596 | 23 | 1 | 4 | 0 |
2597 +-------+--------+----------------+------+
2598 | 24 | 2 | 5 | 0 |
2599 +-------+--------+----------------+------+
2600 | 25 | 4 | 5 | 32 |
2601 +-------+--------+----------------+------+
2602 | 26 | 5 | 5 | 0 |
2603 +-------+--------+----------------+------+
2604 | 27 | 7 | 5 | 32 |
2605 +-------+--------+----------------+------+
2606 | 28 | 8 | 5 | 0 |
2607 +-------+--------+----------------+------+
2608 | 29 | 10 | 5 | 32 |
2609 +-------+--------+----------------+------+
2610 | 30 | 11 | 5 | 0 |
2611 +-------+--------+----------------+------+
2612 | 31 | 13 | 6 | 0 |
2613 +-------+--------+----------------+------+
2614 | 32 | 16 | 5 | 32 |
2615 +-------+--------+----------------+------+
2616 | 33 | 17 | 5 | 0 |
2617 +-------+--------+----------------+------+
2618 | 34 | 19 | 5 | 32 |
2619 +-------+--------+----------------+------+
2620 | 35 | 20 | 5 | 0 |
2621 +-------+--------+----------------+------+
2622 | 36 | 22 | 5 | 32 |
2623 +-------+--------+----------------+------+
2624 | 37 | 23 | 5 | 0 |
2625 +-------+--------+----------------+------+
2626 | 38 | 25 | 4 | 0 |
2627 +-------+--------+----------------+------+
2628 | 39 | 25 | 4 | 16 |
2629 +-------+--------+----------------+------+
2630 | 40 | 26 | 5 | 32 |
2631
2632
2633
2634Collet & Kucherawy Informational [Page 47]
2635
2636RFC 8478 application/zstd October 2018
2637
2638
2639 +-------+--------+----------------+------+
2640 | 41 | 28 | 6 | 0 |
2641 +-------+--------+----------------+------+
2642 | 42 | 30 | 6 | 0 |
2643 +-------+--------+----------------+------+
2644 | 43 | 0 | 4 | 48 |
2645 +-------+--------+----------------+------+
2646 | 44 | 1 | 4 | 16 |
2647 +-------+--------+----------------+------+
2648 | 45 | 2 | 5 | 32 |
2649 +-------+--------+----------------+------+
2650 | 46 | 3 | 5 | 32 |
2651 +-------+--------+----------------+------+
2652 | 47 | 5 | 5 | 32 |
2653 +-------+--------+----------------+------+
2654 | 48 | 6 | 5 | 32 |
2655 +-------+--------+----------------+------+
2656 | 49 | 8 | 5 | 32 |
2657 +-------+--------+----------------+------+
2658 | 50 | 9 | 5 | 32 |
2659 +-------+--------+----------------+------+
2660 | 51 | 11 | 5 | 32 |
2661 +-------+--------+----------------+------+
2662 | 52 | 12 | 5 | 32 |
2663 +-------+--------+----------------+------+
2664 | 53 | 15 | 6 | 0 |
2665 +-------+--------+----------------+------+
2666 | 54 | 17 | 5 | 32 |
2667 +-------+--------+----------------+------+
2668 | 55 | 18 | 5 | 32 |
2669 +-------+--------+----------------+------+
2670 | 56 | 20 | 5 | 32 |
2671 +-------+--------+----------------+------+
2672 | 57 | 21 | 5 | 32 |
2673 +-------+--------+----------------+------+
2674 | 58 | 23 | 5 | 32 |
2675 +-------+--------+----------------+------+
2676 | 59 | 24 | 5 | 32 |
2677 +-------+--------+----------------+------+
2678 | 60 | 35 | 6 | 0 |
2679 +-------+--------+----------------+------+
2680 | 61 | 34 | 6 | 0 |
2681 +-------+--------+----------------+------+
2682 | 62 | 33 | 6 | 0 |
2683 +-------+--------+----------------+------+
2684 | 63 | 32 | 6 | 0 |
2685 +-------+--------+----------------+------+
2686
2687
2688
2689
2690Collet & Kucherawy Informational [Page 48]
2691
2692RFC 8478 application/zstd October 2018
2693
2694
2695A.2. Match Length Code Table
2696
2697 +-------+--------+----------------+------+
2698 | State | Symbol | Number_Of_Bits | Base |
2699 +-------+--------+----------------+------+
2700 | 0 | 0 | 0 | 0 |
2701 +-------+--------+----------------+------+
2702 | 0 | 0 | 6 | 0 |
2703 +-------+--------+----------------+------+
2704 | 1 | 1 | 4 | 0 |
2705 +-------+--------+----------------+------+
2706 | 2 | 2 | 5 | 32 |
2707 +-------+--------+----------------+------+
2708 | 3 | 3 | 5 | 0 |
2709 +-------+--------+----------------+------+
2710 | 4 | 5 | 5 | 0 |
2711 +-------+--------+----------------+------+
2712 | 5 | 6 | 5 | 0 |
2713 +-------+--------+----------------+------+
2714 | 6 | 8 | 5 | 0 |
2715 +-------+--------+----------------+------+
2716 | 7 | 10 | 6 | 0 |
2717 +-------+--------+----------------+------+
2718 | 8 | 13 | 6 | 0 |
2719 +-------+--------+----------------+------+
2720 | 9 | 16 | 6 | 0 |
2721 +-------+--------+----------------+------+
2722 | 10 | 19 | 6 | 0 |
2723 +-------+--------+----------------+------+
2724 | 11 | 22 | 6 | 0 |
2725 +-------+--------+----------------+------+
2726 | 12 | 25 | 6 | 0 |
2727 +-------+--------+----------------+------+
2728 | 13 | 28 | 6 | 0 |
2729 +-------+--------+----------------+------+
2730 | 14 | 31 | 6 | 0 |
2731 +-------+--------+----------------+------+
2732 | 15 | 33 | 6 | 0 |
2733 +-------+--------+----------------+------+
2734 | 16 | 35 | 6 | 0 |
2735 +-------+--------+----------------+------+
2736 | 17 | 37 | 6 | 0 |
2737 +-------+--------+----------------+------+
2738 | 18 | 39 | 6 | 0 |
2739 +-------+--------+----------------+------+
2740 | 19 | 41 | 6 | 0 |
2741 +-------+--------+----------------+------+
2742 | 20 | 43 | 6 | 0 |
2743
2744
2745
2746Collet & Kucherawy Informational [Page 49]
2747
2748RFC 8478 application/zstd October 2018
2749
2750
2751 +-------+--------+----------------+------+
2752 | 21 | 45 | 6 | 0 |
2753 +-------+--------+----------------+------+
2754 | 22 | 1 | 4 | 16 |
2755 +-------+--------+----------------+------+
2756 | 23 | 2 | 4 | 0 |
2757 +-------+--------+----------------+------+
2758 | 24 | 3 | 5 | 32 |
2759 +-------+--------+----------------+------+
2760 | 25 | 4 | 5 | 0 |
2761 +-------+--------+----------------+------+
2762 | 26 | 6 | 5 | 32 |
2763 +-------+--------+----------------+------+
2764 | 27 | 7 | 5 | 0 |
2765 +-------+--------+----------------+------+
2766 | 28 | 9 | 6 | 0 |
2767 +-------+--------+----------------+------+
2768 | 29 | 12 | 6 | 0 |
2769 +-------+--------+----------------+------+
2770 | 30 | 15 | 6 | 0 |
2771 +-------+--------+----------------+------+
2772 | 31 | 18 | 6 | 0 |
2773 +-------+--------+----------------+------+
2774 | 32 | 21 | 6 | 0 |
2775 +-------+--------+----------------+------+
2776 | 33 | 24 | 6 | 0 |
2777 +-------+--------+----------------+------+
2778 | 34 | 27 | 6 | 0 |
2779 +-------+--------+----------------+------+
2780 | 35 | 30 | 6 | 0 |
2781 +-------+--------+----------------+------+
2782 | 36 | 32 | 6 | 0 |
2783 +-------+--------+----------------+------+
2784 | 37 | 34 | 6 | 0 |
2785 +-------+--------+----------------+------+
2786 | 38 | 36 | 6 | 0 |
2787 +-------+--------+----------------+------+
2788 | 39 | 38 | 6 | 0 |
2789 +-------+--------+----------------+------+
2790 | 40 | 40 | 6 | 0 |
2791 +-------+--------+----------------+------+
2792 | 41 | 42 | 6 | 0 |
2793 +-------+--------+----------------+------+
2794 | 42 | 44 | 6 | 0 |
2795 +-------+--------+----------------+------+
2796 | 43 | 1 | 4 | 32 |
2797 +-------+--------+----------------+------+
2798 | 44 | 1 | 4 | 48 |
2799
2800
2801
2802Collet & Kucherawy Informational [Page 50]
2803
2804RFC 8478 application/zstd October 2018
2805
2806
2807 +-------+--------+----------------+------+
2808 | 45 | 2 | 4 | 16 |
2809 +-------+--------+----------------+------+
2810 | 46 | 4 | 5 | 32 |
2811 +-------+--------+----------------+------+
2812 | 47 | 5 | 5 | 32 |
2813 +-------+--------+----------------+------+
2814 | 48 | 7 | 5 | 32 |
2815 +-------+--------+----------------+------+
2816 | 49 | 8 | 5 | 32 |
2817 +-------+--------+----------------+------+
2818 | 50 | 11 | 6 | 0 |
2819 +-------+--------+----------------+------+
2820 | 51 | 14 | 6 | 0 |
2821 +-------+--------+----------------+------+
2822 | 52 | 17 | 6 | 0 |
2823 +-------+--------+----------------+------+
2824 | 53 | 20 | 6 | 0 |
2825 +-------+--------+----------------+------+
2826 | 54 | 23 | 6 | 0 |
2827 +-------+--------+----------------+------+
2828 | 55 | 26 | 6 | 0 |
2829 +-------+--------+----------------+------+
2830 | 56 | 29 | 6 | 0 |
2831 +-------+--------+----------------+------+
2832 | 57 | 52 | 6 | 0 |
2833 +-------+--------+----------------+------+
2834 | 58 | 51 | 6 | 0 |
2835 +-------+--------+----------------+------+
2836 | 59 | 50 | 6 | 0 |
2837 +-------+--------+----------------+------+
2838 | 60 | 49 | 6 | 0 |
2839 +-------+--------+----------------+------+
2840 | 61 | 48 | 6 | 0 |
2841 +-------+--------+----------------+------+
2842 | 62 | 47 | 6 | 0 |
2843 +-------+--------+----------------+------+
2844 | 63 | 46 | 6 | 0 |
2845 +-------+--------+----------------+------+
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858Collet & Kucherawy Informational [Page 51]
2859
2860RFC 8478 application/zstd October 2018
2861
2862
2863A.3. Offset Code Table
2864
2865 +-------+--------+----------------+------+
2866 | State | Symbol | Number_Of_Bits | Base |
2867 +-------+--------+----------------+------+
2868 | 0 | 0 | 0 | 0 |
2869 +-------+--------+----------------+------+
2870 | 0 | 0 | 5 | 0 |
2871 +-------+--------+----------------+------+
2872 | 1 | 6 | 4 | 0 |
2873 +-------+--------+----------------+------+
2874 | 2 | 9 | 5 | 0 |
2875 +-------+--------+----------------+------+
2876 | 3 | 15 | 5 | 0 |
2877 +-------+--------+----------------+------+
2878 | 4 | 21 | 5 | 0 |
2879 +-------+--------+----------------+------+
2880 | 5 | 3 | 5 | 0 |
2881 +-------+--------+----------------+------+
2882 | 6 | 7 | 4 | 0 |
2883 +-------+--------+----------------+------+
2884 | 7 | 12 | 5 | 0 |
2885 +-------+--------+----------------+------+
2886 | 8 | 18 | 5 | 0 |
2887 +-------+--------+----------------+------+
2888 | 9 | 23 | 5 | 0 |
2889 +-------+--------+----------------+------+
2890 | 10 | 5 | 5 | 0 |
2891 +-------+--------+----------------+------+
2892 | 11 | 8 | 4 | 0 |
2893 +-------+--------+----------------+------+
2894 | 12 | 14 | 5 | 0 |
2895 +-------+--------+----------------+------+
2896 | 13 | 20 | 5 | 0 |
2897 +-------+--------+----------------+------+
2898 | 14 | 2 | 5 | 0 |
2899 +-------+--------+----------------+------+
2900 | 15 | 7 | 4 | 16 |
2901 +-------+--------+----------------+------+
2902 | 16 | 11 | 5 | 0 |
2903 +-------+--------+----------------+------+
2904 | 17 | 17 | 5 | 0 |
2905 +-------+--------+----------------+------+
2906 | 18 | 22 | 5 | 0 |
2907 +-------+--------+----------------+------+
2908 | 19 | 4 | 5 | 0 |
2909 +-------+--------+----------------+------+
2910 | 20 | 8 | 4 | 16 |
2911
2912
2913
2914Collet & Kucherawy Informational [Page 52]
2915
2916RFC 8478 application/zstd October 2018
2917
2918
2919 +-------+--------+----------------+------+
2920 | 21 | 13 | 5 | 0 |
2921 +-------+--------+----------------+------+
2922 | 22 | 19 | 5 | 0 |
2923 +-------+--------+----------------+------+
2924 | 23 | 1 | 5 | 0 |
2925 +-------+--------+----------------+------+
2926 | 24 | 6 | 4 | 16 |
2927 +-------+--------+----------------+------+
2928 | 25 | 10 | 5 | 0 |
2929 +-------+--------+----------------+------+
2930 | 26 | 16 | 5 | 0 |
2931 +-------+--------+----------------+------+
2932 | 27 | 28 | 5 | 0 |
2933 +-------+--------+----------------+------+
2934 | 28 | 27 | 5 | 0 |
2935 +-------+--------+----------------+------+
2936 | 29 | 26 | 5 | 0 |
2937 +-------+--------+----------------+------+
2938 | 30 | 25 | 5 | 0 |
2939 +-------+--------+----------------+------+
2940 | 31 | 24 | 5 | 0 |
2941 +-------+--------+----------------+------+
2942
2943Acknowledgments
2944
2945 zstd was developed by Yann Collet.
2946
2947 Bobo Bose-Kolanu, Felix Handte, Kyle Nekritz, Nick Terrell, and David
2948 Schleimer provided helpful feedback during the development of this
2949 document.
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970Collet & Kucherawy Informational [Page 53]
2971
2972RFC 8478 application/zstd October 2018
2973
2974
2975Authors' Addresses
2976
2977 Yann Collet
2978 Facebook
2979 1 Hacker Way
2980 Menlo Park, CA 94025
2981 United States of America
2982
2983 Email: cyan@fb.com
2984
2985
2986 Murray S. Kucherawy (editor)
2987 Facebook
2988 1 Hacker Way
2989 Menlo Park, CA 94025
2990 United States of America
2991
2992 Email: msk@fb.com
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026Collet & Kucherawy Informational [Page 54]
3027