avtcore WG E. Språng Internet-Draft Google Intended status: Informational 16 March 2025 Expires: 17 September 2025 RTP Header Extension for Automatic Detection of Video Corruptions draft-sprang-avtcore-corruption-detection-00 Abstract This document describes an RTP header extensions that can be use to carry select filtered samples from a video frame together with metrics relating to the expected distortions caused by lossy compression of said frame. This data allows a receiver to validate that the output from a decoder matches the frame the went into the encoder on the remote - i.e. validating that the video pipeline is in a correct state free of any video corruptions. About This Document This note is to be removed before publishing as an RFC. The latest revision of this draft can be found at https://github.com/sprangerik/corruption-detection/blob/main/draft- sprang-avtcore-corruption-detection.md. Status information for this document may be found at https://datatracker.ietf.org/doc/draft- sprang-avtcore-corruption-detection/. Discussion of this document takes place on the avtcore WG Working Group mailing list (mailto:avt@ietf.org), which is archived at https://datatracker.ietf.org/wg/avtcore. Subscribe at https://www.ietf.org/mailman/listinfo/avt/. Source for this draft and an issue tracker can be found at https://github.com/sprangerik/corruption-detection. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Språng Expires 17 September 2025 [Page 1] Internet-Draft Video Corruption Detection March 2025 Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 17 September 2025. Copyright Notice Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Conventions and Definitions . . . . . . . . . . . . . . . . . 3 3. Applicability . . . . . . . . . . . . . . . . . . . . . . . . 4 4. Corruption Detection . . . . . . . . . . . . . . . . . . . . 4 4.1. RTP header extension format . . . . . . . . . . . . . . . 4 4.1.1. Data layout overview . . . . . . . . . . . . . . . . 4 4.1.2. Data layout details . . . . . . . . . . . . . . . . . 5 4.1.3. Synchronization Message . . . . . . . . . . . . . . . 6 4.2. Details of Operation . . . . . . . . . . . . . . . . . . 6 4.2.1. Halton Sequence Handling . . . . . . . . . . . . . . 6 4.2.2. Halton Sequence Algorithm . . . . . . . . . . . . . . 6 4.2.3. Spatial Layer Considerations . . . . . . . . . . . . 8 4.2.4. Sample Filtering . . . . . . . . . . . . . . . . . . 8 4.2.5. Handling of Image Planes . . . . . . . . . . . . . . 9 4.2.6. Allowed Error Thresholds . . . . . . . . . . . . . . 10 4.2.7. Determining Filter Size and Error Thresholds . . . . 10 4.2.8. Calculating A Corruption Score . . . . . . . . . . . 10 5. Security Considerations . . . . . . . . . . . . . . . . . . . 11 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 7.1. Normative References . . . . . . . . . . . . . . . . . . 11 7.2. Informative References . . . . . . . . . . . . . . . . . 11 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 12 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 12 Språng Expires 17 September 2025 [Page 2] Internet-Draft Video Corruption Detection March 2025 1. Introduction The Corruption Detection (sometimes referred to as automatic corruption detection or ACD) extension is intended to be a part of a system that allows estimating a likelihood that a video transmission is in a valid state. That is, the input to the video encoder on the send side corresponds to the output of the video decoder on the receive side with the only difference being the expected distortions from lossy compression. The goal is to be able to detect outright coding errors caused by things such as bugs in encoder/decoders, malformed packetization data, incorrect relay decisions in SFU-type servers, incorrect handling of packet loss/reordering, and so forth. This should be accomplished with a high signal-to-noise ratio while consuming a minimum of resources in terms of bandwidth and/or computation. It should be noted that it is not a goal to be able to e.g. gauge general video quality using this method. This is accomplished in two steps. On the sender side: * Pseudo randomly (using a Halton Sequence) sampling a small set of locations within the input to the encoder. * Applying a gaussian low-pass filter in order to supress encoding distortions. * Determining an "allowed error" threshold, below which samples are not determined as erroneus. Then, on the receiver side: * Sampling the same locations as the sender, but in the output image from the decoder. * Applying the same low-pass filter as the sender * Comparing the values in the samples provided in the header extension with those sample from the output image, subtracting the allowed error from each sample pair. 2. Conventions and Definitions The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. Språng Expires 17 September 2025 [Page 3] Internet-Draft Video Corruption Detection March 2025 3. Applicability The Corruption Detection extension can be used to validate the correctness of any video pipeline in an end-to-end manner, as long as there is no termination of the bitstream in the process. It is designed to be codec agnostic. In terms of [RFC7667], any Point-to-Point or Transport Translator topologies can be used. The header extension just needs to be propagated together with the media packet until the decode stage. If however a Media Translator is used, the header extension needs to be extracted at the time of decoding there (e.g during transcoding). If corruption detection is desired to the end user, then a new Corruption Detection header extension needs to be created as part of the middlebox encoding step and thet can then be carried forward. 4. Corruption Detection _Name:_ "Corruption Detection"; "Extension for Automatic Detection of Video Corruptions" _Formal name:_ http://www.webrtc.org/experiments/rtp-hdrext/ corruption-detection _Status:_ This extension is defined here to allow for experimentation. Experience with the experiment has shown that it is useful; this draft therefore presents it to the IETF for consideration of whether to standardize it or leave it as a proprietary extension. 4.1. RTP header extension format 4.1.1. Data layout overview Data layout of a corruption-detection header extension, using a One- Byte header (see [RFC5285]) section 4.2: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ID | len=7 |B| seq index | std dev | Y err | UV err| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | sample 0 | sample 1 | ... up to sample <= 12 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Data layout of a corruption-detection header extension, using a Two- Byte header (see [RFC5285]) section 4.3: Språng Expires 17 September 2025 [Page 4] Internet-Draft Video Corruption Detection March 2025 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ID | L=0 | ID | len=7 |B| seq index | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | std dev | Y err | UV err| sample 0 | sample 1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... up to sample <= 255 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4.1.2. Data layout details 4.1.2.1. B (1 bit) Whether the sequence number should be interpreted as the MSB or LSB of the full size 14 bit sequence index described in the next point. 4.1.2.2. seq index (7 bits) The index into the Halton sequence (used to locate where the samples should be drawn from). If B is set: the 7 most significant bits of the true index. The 7 least significant bits of the true index shall be interpreted as 0. This is because this is the point where we can guarantee that the sender and receiver has the same full index. B MUST be set on keyframes. On droppable frames B MUST NOT be set. If B is not set: The 7 LSB of the true index. The 7 most significant bits should be inferred based on the most recent message. 4.1.2.3. std dev (8 bits) The standard deviation of the Gaussian filter used to weigh the samples. The value is scaled using a linear map: 0 = 0.0 to 255 = 40.0. A std dev of 0 is interpreted as directly using just the sample value at the desired coordinate, without any weighting. 4.1.2.4. Y err (4 bits) The allowed error for the luma channel. 4.1.2.5. UV err (4 bits) The allowed error for the chroma channels. Språng Expires 17 September 2025 [Page 5] Internet-Draft Video Corruption Detection March 2025 4.1.2.6. Sample N (8 bits) The N:th filtered sample from the input image. Each sample represents a new point in one of the image planes, the plane and coordinates being determined by index into the Halton sequence (starting at seq# index and is incremented by one for each sample). Each sample has gone through a Gaussian filter with the std dev specified above. The samples have been floored to the nearest integer. 4.1.3. Synchronization Message A special case is the so-called "synchronization" message. Such a message only contains the first byte. They are used to keep the sender and receiver in sync even if no "full" message has been received for a while. Such messages MUST NOT be sent on droppable frames. 4.2. Details of Operation 4.2.1. Halton Sequence Handling The sender and receiver need to find the same set of pseudo random sampling locations for an instrumented frame. Sending explicit coordinates for each sample would however use a significant amount of bandwidth, in fact potentially several times the size of the sample values themselves. To avoid this overhead, we instead define that a 2D Halton Sequence is used to generate the sample coordinates - meaning that the clients only need to keep track of the index into that sequence in order to generate the desired sequence of sampling coordinates. The bases used are 2 for the row and 3 for the column. 4.2.2. Halton Sequence Algorithm In pseudo code, the Halton sequence can implemented as follows: Inputs: index i, base b Where b is 2 when calculating the row (y-axis) and 3 when calculating the column (x-axis) Språng Expires 17 September 2025 [Page 6] Internet-Draft Video Corruption Detection March 2025 f = 1; r = 0; while i > 0 do { f = f / b; r = r + f * (i mod b); i = floor(i / b); } return r; // Coordinate component. 4.2.2.1. Sequence Index Counter The sequence index into the Halton sequence is a 14 bit unsigned integer, which wraps around to zero on overflow. Only half of those bits are transmitted with any given message. If B = 1 then the 7 most significant bits are transmitted and the 7 least significant bits are implicitly set to 0. If B = 1 then the 7 least significant bits are transmitted and the most significant bits are inferred for the current state. For keyframes, the sender MUST set the value of B to 1. For droppable frames (e.g. non-baselayer frames when using temporal layers), the sender MUST set B to 0 to prevent sender and receiver from out of sync should that frame be lost. 4.2.2.2. Sequence Index Stepping A sequence index MUST be present on all keyframes of an instrumented video stream, however there are no requirements around which sequence index to start at; the sender may choose that arbitrarily. This means we always start with a known index. For each sample within the corruption detection header extension, the sequence index is assumed to be incremented by one. This allows a variable amount of samples to be present in an extension, if so desired. When droppable frames are used (e.g. when using temporal layering), a middle box may drop some frames resulting in a gap between the last sequence of the previous extension and sequence index indicated by the header of the next extension. If this happens and the new frame has B = 0, then the receiver is intended to just increment the sequence number index until the 7 least significant bits in the index matches the value in the extension header (possibly including an increment of the more significant bits). Språng Expires 17 September 2025 [Page 7] Internet-Draft Video Corruption Detection March 2025 This means that the sender MUST NOT add >= 127 consecutive samples to droppable frames, otherwise the most significant bits may come out of sync. If needed, a "sync message" can be added to guarantee sequence index alignment even if no samples are available. This is done by sending a truncated message containing just B and the sequence index. 4.2.3. Spatial Layer Considerations When multiple spatial layers are present within a single SSRC, the sender MUST produce a separate and independent stream of corruption detection headers for each spatial layer. This ensures that a receiver can decode and verify the highest spatial layer that is part of the stream they are receiving, and any layers culled by a middlebox does not affect the integrity of e.g. the sequence index series for that layer. When using a scalability mode where a higher spatial layer uses inter-layer prediction (prediction between frames belonging to the same temporal unit, e.g. the "SxTx" modes in the W3C [SVC] spec), then the frame should be treated as a key-frame if any frame in the dependency chain within that temporal layer is a key-frame. 4.2.4. Sample Filtering The std dev field of the extension indicates the standard deviation of a 2D Gaussian blur kernel that should be applied taking samples - both to the raw input frame before encoding at the sender and to the raw output frame from the decoder at the receiver. Further, in order to reduce the number of samples the algorithm needs to average - it is optionally allowed to ignore parts of the image where weight from the Gaussian becomes small. In practice, weights below 0.2 have shown to be small enough that they have negligible effect on the end result. Any samples outside the plane are considered to have weight 0. Note that this processing is done on a per-image-plane basis. In pseudo-code, that means we get the following: Språng Expires 17 September 2025 [Page 8] Internet-Draft Video Corruption Detection March 2025 // Translate 8bit header value to floating point. stddev = header.std_dev * (40.0 / 255); // Max number of samples away from the center. max_d = ceil(sqrt(-2.0 * ln(0.2) * stddev^2) - 1; sample_sum = 0; weight_sum = 0; for (y = max(0, row - max_d) to min(plane_height, row + max_d) { for (x = max(0, col - max_d) to min(plane_width, col + max_d) { weight = e^(-1 * ((y - row)^2 + (x - col)^2) / (2 * stddev^2)); sample_sum += SampleAt(x, y) * weight; weight_sum += weight; } } filtered_sample = sample_sum / weight_sum; 4.2.5. Handling of Image Planes For now this header extension is only defined for 4:2:0 chroma subsampling. // TODO: Discuss supporting other formats. In order to translate the row/column calculated from the Halton sequence into a coordinate within a given image plane, visualize the U/V (chroma) planes as being attached to the Y (luma) planes as follows: +------+---+ | | U | + Y +---+ | | V | +------+---+ In pseudo code: row = GetHaltonSequence(seq_index, /*base=*/2) * image_height; col = GetHaltonSequence(seq_index, /*base=*/3) * image_width * 1.5; if (col < image_width) { HandleSample(Y_PLANE, row, col); } else if (row < image_height / 2) { HandleSample(U_PLANE, row, col - image_width); } else { HandleSample(V_PLANE, row - (image_height / 2), col - image_width); } Språng Expires 17 September 2025 [Page 9] Internet-Draft Video Corruption Detection March 2025 4.2.6. Allowed Error Thresholds The header extension contains two fields for an "allowed error"; one for the luma channel and one for the chrome channels. The two values allow for accommodating different spatial resolutions and thus different error magnitudes for the respective image planes. When calculating the difference between two samples in a receiving client (i.e. one locally filtered sample and the corresponding remote sample from the header extension) the absolute magnitude of the error should be reduced by the specified amount. 4.2.7. Determining Filter Size and Error Thresholds The filter size (std dev of the blur kernel) and the allowed error thresholds SHOULD be set such that 99.5% of filtered samples end up with a delta <= the error threshold for that plane, based on a representative set of test clips and bandwidth constraints. This text does not put restrictions on how exactly an implementation should determine appropriate thresholds, but a straightforward way is to look at the QP and codec type as this translates roughly the expected compression distortion. Individual implementations may exhibit better or worse performance than an "average" encoder implementation for a given codec type - so a sender is encouraged to compensate for this if the implementation is far outside the expected bounds based on averages. 4.2.8. Calculating A Corruption Score This text does not put exact requirements on how the differences calculated should be translated into an output. While it is tempting to try and use the 99.5% percentile aspect and use a simple statistical method to estimate the probability of this magnitude of error with a given confidence interval - in practice that has proven tricky due to the number outliers that don't quite follow a normal distribution. In practice, it has been found that just taking the sum of all differences squared and dividing by two yields a reasonably good result. Further improvements in this area should be possible without having to make changes to the message format itself. Språng Expires 17 September 2025 [Page 10] Internet-Draft Video Corruption Detection March 2025 5. Security Considerations The de facto way of encrypting RTP packets is using SRTP as defined in [RFC3711]. Unfortunately, SRTP does not provide encryption for header extensions as they are not seen as part of the payload. That poses a serious security risk, as the corruption detection samples are in essence a very sparse encoding of the source image - so given a sufficient number of samples and a static scene an eavesdropper could rather trivially reconstruct the scene from just header extensions. While technically not required, the sender SHOULD negotiate some form of encryption that covers this header extension. The most straightforward methods is to use either [RFC6904] or [RFC9335], both of which are specifically designed to allow selective encryption of sensitive RTP header extensions. 6. IANA Considerations If the WG decides that this extension should be registered as a standardized extension, IANA is requested to perform the appropriate registration. If the WG decides that this is a private extension, the URL http://www.webrtc.org/experiments/rtp-hdrext/corruption-detection is used to identify the extension. 7. References 7.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC5285] Singer, D. and H. Desineni, "A General Mechanism for RTP Header Extensions", RFC 5285, DOI 10.17487/RFC5285, July 2008, . [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . 7.2. Informative References Språng Expires 17 September 2025 [Page 11] Internet-Draft Video Corruption Detection March 2025 [RFC3711] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K. Norrman, "The Secure Real-time Transport Protocol (SRTP)", RFC 3711, DOI 10.17487/RFC3711, March 2004, . [RFC6904] Lennox, J., "Encryption of Header Extensions in the Secure Real-time Transport Protocol (SRTP)", RFC 6904, DOI 10.17487/RFC6904, April 2013, . [RFC7667] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 7667, DOI 10.17487/RFC7667, November 2015, . [RFC9335] Uberti, J., Jennings, C., and S. Murillo, "Completely Encrypting RTP Header Extensions and Contributing Sources", RFC 9335, DOI 10.17487/RFC9335, January 2023, . [SVC] W3C, "Scalable Video Coding (SVC) Extension for WebRTC", n.d., . Acknowledgments Fanny Linderborg and Emil Vardar for their contributions in implementing and evaluating the corruption detection mechanism. Author's Address Erik Språng Google Email: sprang@google.com Språng Expires 17 September 2025 [Page 12]