Windows Wheels: Packaging Gotchas¶
Notes captured during the GH-40 Windows-wheels-to-PyPI work. Recorded here so future maintainers don't re-walk the same multi-day diagnosis paths.
The Windows build differs from Linux/macOS in three ways that interact badly with anything resembling a plugin architecture (and tesseract is a plugin architecture top-to-bottom). Each one fails silently at packaging time and surfaces only at runtime, with an error message that points away from the real cause.
Plugin DLL transitive dependencies¶
delvewheel needs --analyze-existing to patch plugin DLLs
delvewheel repair walks the import tables of .pyd extension modules
by default — nothing else. Plugin factory DLLs (tesseract_*_factories.dll,
tesseract_kinematics_opw_factory.dll, …) that the build pipeline copies
into the wheel's package dir before the repair step do get included
in the wheel, but their import tables stay unpatched and any deps they
pull in beyond what the .pyd files already need go un-bundled.
The fix is delvewheel repair --analyze-existing … — that flag tells
delvewheel to scan and rewrite every DLL in the wheel, not just .pyd.
Without it, the wheel builds, installs, smoke-imports, and then fails at
first plugin instantiation. The misleading part: every test that touches
a plugin (kinematics, collision, task-composer pipelines) fails with
"Failed to load symbol", which sends the diagnosis chasing
__declspec(dllexport) and CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS. Those
aren't the bug.
Runtime DLL search path¶
os.add_dll_directory does not cover plain LoadLibrary from C++
os.add_dll_directory() (which delvewheel auto-injects into __init__.py
for the <project-name>.libs/ directory) only affects LoadLibrary calls
that pass LOAD_LIBRARY_SEARCH_USER_DIRS. CPython's .pyd loader passes
that flag, so add_dll_directory "works" for module imports — which
builds false confidence that it covers the whole process.
Once a .pyd is loaded and its C++ code calls LoadLibrary("plugin.dll")
itself — which is exactly what boost::dll::shared_library does, and
therefore what boost_plugin_loader does — that C++ call uses the
legacy Windows DLL search order, which respects PATH but ignores
add_dll_directory. The fix is to prepend the package dir and
<project-name>.libs/ to PATH in __init__.py before any submodule
import. See src/tesseract_robotics/__init__.py for the canonical block.
Note the directory naming asymmetry that bites here: delvewheel uses the
project's metadata name (tesseract-robotics-nanobind) for the libs
dir, not the Python package name (tesseract_robotics). It is named
tesseract_robotics_nanobind.libs/ on disk.
Diagnosing plugin-load failures¶
boost_plugin_loader's 'Failed to load symbol' is two errors in a trench coat
boost::dll-backed plugin loaders log
Failed to load symbol 'X' in library: Y in two distinct cases that look
identical from Python:
LoadLibrarysucceeded butGetProcAddressreturnedNULL— the symbol genuinely isn't in the DLL's export table.LoadLibraryreturnedNULLbecause a transitive dependency couldn't be resolved — the loader catches the boost exception and re-reports it as "Failed to load symbol".
On Windows, case 2 is far more common than case 1 and points the
investigation in a completely different direction (DLL search path,
bundled deps, delvewheel --analyze-existing) than case 1 would
(__declspec(dllexport), symbol visibility flags). Always confirm with
dumpbin /EXPORTS plugin.dll whether the symbol is actually present
before assuming an export issue. The CI workflow's
verify plugin factory exports step runs this check automatically and
fails fast if the export table is the problem.
Cereal polymorphic registration¶
Cereal's polymorphic registry is per-DLL on Windows MSVC — consumers must re-register types in their own TU
cereal stores its polymorphic-type registry in class-template statics
(StaticObject<InputBindingMap<Archive>>,
StaticObject<OutputBindingMap<Archive>>, ...). Linux and macOS linkers
dedupe these template instantiations across shared libraries, giving the
process one shared registry. MSVC does not — each DLL gets its own
copy of every template static, with no automatic dllexport/dllimport
plumbing in cereal headers to share them.
Consequence: an upstream library (e.g. tesseract_command_language.dll)
that calls CEREAL_REGISTER_TYPE populates its own registry only.
A downstream .pyd that does serialization reads from its own empty
registry and fails at runtime with
RuntimeError: Trying to save an unregistered polymorphic type. The
type really is registered upstream — just not in the registry the
consumer actually consults.
CEREAL_FORCE_DYNAMIC_INIT is not enough on its own. It anchors
upstream's registration TU to keep the linker from stripping it, but
doesn't bridge the registry gap on the consumer side.
The actual Windows fix is to mirror upstream's registration block in
the consumer TU. For tesseract_command_language types
(MoveInstructionPoly, CompositeInstruction, etc.) that means copying
every CEREAL_REGISTER_TYPE and CEREAL_REGISTER_POLYMORPHIC_RELATION
line from upstream
tesseract_command_language/src/cereal_serialization.cpp into our
binding. Re-registration is a no-op on Linux/macOS (cereal tolerates
duplicate registrations of the same {type, archive} pair) and
populates the consumer's per-DLL registry on Windows. See
src/tesseract_serialization/tesseract_serialization_bindings.cpp for
the canonical placement; keep CEREAL_FORCE_DYNAMIC_INIT(LibName) as
belt-and-suspenders on the producer side.