Skip to content

docs(ci): deduplicate doc images before packaging the artifact#4157

Draft
grandixximo wants to merge 1 commit into
LinuxCNC:masterfrom
grandixximo:docs-image-dedup
Draft

docs(ci): deduplicate doc images before packaging the artifact#4157
grandixximo wants to merge 1 commit into
LinuxCNC:masterfrom
grandixximo:docs-image-dedup

Conversation

@grandixximo

@grandixximo grandixximo commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

The HTML build copies every referenced image into each language tree, so the built doc tree is ~324MB with ~227MB of duplicate images. This collapses byte-identical images into a shared image/ tree and rewrites the refs, dropping the published tarball's tree to ~118MB.

It runs as a CI step in the htmldocs job, after the build and before the tar, so the build itself stays untouched and only the packaged artifact is deduplicated. The pass lives in .github/scripts/dedup-docs.py (per @hdiethelm's suggestion to keep it out of the Makefile).

Verified on a full 8-language build: 227MB of images collapse to 28.8MB (87% reclaimed), the pass self-verifies that every image reference still resolves after rewriting, is idempotent, and leaves the CSS and non-image refs untouched. Languages are read from docs/po4a.cfg, the same source the doc build uses.

@BsAtHome

Copy link
Copy Markdown
Contributor

I went with a post-build pass rather than doing this in image_resolver.rb.

When it works as intended without too much extra work,... why not. I trust you have weighed the options and went for the better one :-)

I'll have a look, later.

@grandixximo

Copy link
Copy Markdown
Contributor Author

Thanks. FWIW I did spend a fair bit of time weighing the build-integrated alternative before settling on the post-build pass.

This PR adds just one file and it is readable, but I agree the reason it has to exist is ugly: the build creates the duplicates and then this cleans them up.

The real alternative is not only refactoring the Ruby resolver but also shifting the build from zip to tar, so the resolver can place symlinks and have them survive. That preserves the dedup all the way through (artifact, deb, fetch), but it is a larger blast radius and would need coordination with @hdiethelm.

If we agree on that shape instead, the result is a more elegant structure, and I think genuinely better. Happy to go that way if you and @hdiethelm are on board.

@andypugh

andypugh commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

A bit late to think of this, but wouldn't Python be a good choice for the script to pull down the docs?
(I just checked, we do have Python)

I commented in the wrong thread. I was meaning to use Python on the web server rather than bash / php

@hdiethelm

Copy link
Copy Markdown
Contributor

Sorry, I just see this now. Hmm, does this 300MB hurt anyone? Otherwise, we could just leave it as it is.

The best solution of course would be to build the docs in a way that common images are in a common folder and the html links are correct at build time, not just clean it up afterwards. But this might be unreasonable complex, I have no clue how this doc build works.

If you want to go with symlinks:

https://github.com/actions/upload-artifact can do zip or raw files. So I should be able to use tar to create an archive in CI and then upload this as raw file. Should I try that in #4150?

@hdiethelm

Copy link
Copy Markdown
Contributor

I went for .tar.gz anyway, so you are free to do what you need.

If the size is a real issue for the homepage server but nowhere else, an option would be to run your dedup scrip in CI only for the artifact but not in the makefile. You can easily add another step before the tar step. If it's CI only, the right place for the script is in .github/scripts/ so the tree is not messed up with CI scripts.

That would be something like:

    - name: Dedup linuxcnc-doc
      run: |
        set -x
        python3 .github/scipts/dedup-docs.py docs/build/html

@grandixximo grandixximo changed the title docs: deduplicate built HTML images into a shared image/ tree docs(ci): deduplicate doc-artifact images into a shared image/ tree Jun 18, 2026
@grandixximo

grandixximo commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Went with your CI-only approach. The dedup is now .github/scripts/dedup-docs.py with no Makefile wiring, so the build itself is untouched and only the packaged artifact gets deduped.

On a full 8-language build it takes the tree from ~324MB to ~118MB (227MB of duplicate images down to 28.8MB), self-verifies that every image reference still resolves after the rewrite, and is idempotent.

This PR is only the script. Could you add the invocation before the tar step in #4150?

- name: Deduplicate doc images
  run: python3 .github/scripts/dedup-docs.py docs/build/html --apply

One ordering note: that step needs the script on master first, else the CI build (your branch merged onto master) cannot find .github/scripts/dedup-docs.py and fails. So merge #4157 first, then rebase here and add the line. Or guard it with [ -f .github/scripts/dedup-docs.py ] to avoid depending on order.

@hdiethelm

Copy link
Copy Markdown
Contributor

This working on different branches on the same topic is anoying. Best you rebase on top of my branch and do the CI change yourself. If yours get merged first, my changes are in and I close mine. If mine gets merged first, you can rebase to master.

The HTML build copies every referenced image into each language tree, so
the built doc tree is ~324MB with ~227MB of duplicate images. Add a CI step
that collapses byte-identical images into a shared image/ tree and rewrites
the refs before the tar step, dropping the published tarball's tree to
~118MB. The pass lives in .github/scripts/dedup-docs.py, runs only in CI,
and leaves the build itself untouched. It self-verifies that every image
reference still resolves and is idempotent.
@grandixximo grandixximo changed the title docs(ci): deduplicate doc-artifact images into a shared image/ tree docs(ci): deduplicate doc images before packaging the artifact Jun 18, 2026
@grandixximo

Copy link
Copy Markdown
Contributor Author

@hdiethelm since #4150 merged, the tar step is on master, so I added the dedup call to the CI in this PR instead of asking you to wire it. It runs in the htmldocs job after the build and before the tar, so only the packaged artifact is deduplicated and the build itself stays untouched.

@BsAtHome

Copy link
Copy Markdown
Contributor

Deduplication after the fact is wasteful. First you create a lot of copies just to do a lot of work to get rid of them again. That is working backwards and fixing the symptoms of a "maybe-not-so-good" choice that caused the problem.

The better approach, IMO, is not to create the duplicates in the first place. You already have to traverse all the links for the languages to match a translated image and use it instead of the generic/English version. I have a hard time understanding why this extra work after the fact is better than prevention?

@grandixximo grandixximo marked this pull request as draft June 18, 2026 14:15
@grandixximo

Copy link
Copy Markdown
Contributor Author

It will be a bigger PR, larger blast radius, I was settling for good enough with Hannes proposal...
Will reshape as we discussed before, we should be ok with symlinks right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants