Kevin Wheatley: There have been various discussions on the OCIO Slack.
Alex Fry: I've been working on my ICtCp comparison tool, but nothing to show yet.
Carol Payne: I realized Alex is not on the OCIO Slack. I'll invite him.
Cuneyt Ozdas: I work with Doug at Autodesk, and he asked me to look into the Metal issue. I found that in Metal we have a wrapper class, which is absent in OpenGL, where we define a lookup table. This is populated in the runtime, probably for every thread, which hits the L1 cache and affects performance. Rémi had a PR which replaced the constant float buffer with a texture lookup, which seems to fix the issue. I also tried pulling the array outside the struct, making it a constant global array. That fixes it too. So it seems the Metal shader contractor is doing something wrong.
Rémi Achard: I guess we should share your findings with the Metal guys at Apple.
Cuneyt Ozdas: I guess this shows obviously in Apple Silicon, but the fix will probably improve other GPUs too.
Rémi Achard: I have thew same issue on my non M Mac with AMD GPU.
Carol Payne: Eric's reply suggests that's also happening in HLSL.
Rémi Achard: I think it's another reason in HLSL, where we were using a texture lookup inside the loop. That's why I switched to the constant array in the first place. It only affects some older DirectX versions.
Cuneyt Ozdas: I wondered about using uniform buffers instead of arrays.
Rémi Achard: I did try that. It doesn't change the speed. But it won't work with OpenGL before v2. We could do it for Metal
Doug Walker: There are limits to how many uniforms you can have.
Rémi Achard: There may be compatibility and limit issues.
Kevin Wheatley: So we certainly need to modify things for Metal and ask Apple if it's a bug or a feature. I would think using an extra texture would be less preferable.
Rémi Achard: That would be better for implementers. Although we already have 2 or 3 textures..
Kevin Wheatley: My branch has 2 textures, but could be reduced to 1 if we sample everything on the same hue. It might make the chroma compressor slower if used independently, but we may not care about that.
Rémi Achard: On my laptop I had an issue if I had too many instances of the transform, the GPU locked up. It doesn't happen if I use a texture.
Kevin Wheatley: If people report problems with that we could look at ways to reuse the same table it there are multiple instances with the same target gamut. That could be the most likely use case, going backward and forward through the same transform. I haven't opened a PR yet. I want to test more first. I made the tone scale go back from achromatic rather than J, as Nick suggested. It makes a difference at least on the CPU by eliminating a pow call. I made the CPU chroma compression use dot products, as the GPU already did. Other than those it's just small scaling factor changes and a rework of the J intersect solve. I didn't try moving where the norm was calculated, which was discussed, moving it to a higher scope. Nor did I look at whether we store hues in degrees, radians or something else, to avoid some trigonometry. Nick mentioned that Doug had suggested we could maybe avoid the polar representation. We don't need the angle. We just need an index for the lookups.
Nick Shaw: It makes sense that since the angle came from the rectangular form in the first place, it makes sense to reuse that rather than recalculate it from the angle. The a and b used in Pekka's fitted curve are just scaled versions of the a and b from the Aab you have calculated on the way to JMh. Sin and cos h are also used later to go back from JMh to XYZ, and maybe if you had held the original a and b you could reuse them there too.
Kevin Wheatley: That's another reason to move the norm calculation to a higher level where a and b are still available.
Carol Payne: So if you need another couple of days before opening a PR that's ok, particularly if you can get that new stuff in, and have GPU/CPU parity, so people can test and profile. You won't have the Metal fix in it, correct? So we'll send it to the Metal guys together with our suggested fix. Then we have to decide if this is good enough.
Kevin Wheatley: I'm hoping for some code feedback from people.
Carol Payne: The hopefully we can put something out, together with some release notes on what we've done. Then later the CTL can be updated.
Nick Shaw: GPU profiling is not something a general user can do, is it?
Carol Payne: No. But the people who tested before can retest. Something others can do is look at the configs. Those will become part of the next release.
Alex Fry: For my delta tool I wondered where I could get a build of OCIO.
Doug Walker: You can just pip install 2.4.1. That includes ociodisplay, which uses the GPU, and ocioconvert. I also made a combined ACES 1 and 2 config for testing, which is linked from the optimization Wiki.
ACES Output Transforms VWG
Meeting #179, February 5th, 12pm PT
[Meeting Recording]
Attendees
Meeting Notes