Optimizing Image Processing with Neon Intrinsics Document ID: 101964_0300_01_en
Version 3.0
libTIFF optimization: CMYK to RGBA conversion
Neon-optimized implementation
The Neon-optimized implementation uses intrinsics to perform calculations for multiple pixels
simultaneously.
As with the other libTIFF optimizations in this guide, the full version of the code
uses an if statement to check whether to use the original code or the Neon code.
For example, we only optimize the case where skewing is not required, that is when
the values toskew and fromskew are zero. For clarity, the if statement is not shown
here.
// Loop over all pixels of the image
uint32_t np = w * h;
uint32_t* endp = cp + np;
// Indices for VTBL that duplicate each pixels K value
uint8x8_t dupK1 = vcreate_u8(0xff06ff06ff06ff06ull);
uint8x8_t dupK2 = vcreate_u8(0xff0eff0eff0eff0eull);
uint8x16_t kindices = vcombine_u8(dupK1, dupK2);
// Indices to obtain the final results
uint8_t resultIndices[16] = {0,1,2,-1,4,5,6,-1,8,9,10,-1,12,13,14,-1};
while(cp < endp) {
// 16 copies of 255
uint8x16_t v255 = vdupq_n_u8 (255);
// load 4 pixels (each pixel is 4 bytes with the CMYK values)
uint8x16_t src_u8 = vld1q_u8(pp);
// perform (255 - x) on each component
// each vsubl is working on 2 pixels
uint16x8_t subl = vsubl_u8(vget_low_u8(v255), vget_low_u8(src_u8));
uint16x8_t subh = vsubl_high_u8(v255, src_u8);
// duplicate k element from each pixel in subl
uint8x16_t kl = vqtbl1q_u8(subl, kindices);
uint8x16_t kh = vqtbl1q_u8(subh, kindices);
// multiply (255 - x) by (255 - k)
uint16x8_t ml = vmulq_u16(kl, subl);
uint16x8_t mh = vmulq_u16(kh, subh);
// the results we need are in the low 8 bits of the uint16 elements
// combine results and result (throwing away all the upper halves of all the
uint16)
uint8x16_t idx = vld1q_u8 (resultIndices);
uint16x8_t resultl = ml / 255;
uint16x8_t resulth = mh / 255;
uint8x16_t packed = vuzp1q_u8 (vreinterpretq_u8_u16 (resultl),
vreinterpretq_u8_u16 (resulth));
// wherever the index is -1, we take the value from v255 (we return 255 in
alpha)
uint8x16_t pixels = vqtbx1q_u8 (v255, packed, idx);
// store the four RGBA pixels and advance the pointers/counters
vst1q_u8((uint8_t*)cp, pixels);
cp += 4;
pp += samplesperpixel * 4;
}
The following table shows more information about the intrinsics in this example:
Table: Intrinsics used in the CMYK to RGBA example
Intrinsic Description
vcombine_u8 Join two smaller vectors into a single larger vector
Copyright © 2019–2021, 2023–2024 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 31 of 33