Page 1 of 1

Can't use spatial_hash with multiple physics engines

Posted: Mon May 30, 2022 1:51 am
by jharwell
Hello,

ARGoS crashes whenever I try to use the spatial_hash capability of the dynamics2d engine with multiple engines. If I set two physics engines (each has half the arena) and then 2 threads, ARGoS always crashes. If I then set ARGoS to only use 1 thread (keeping multiple physics engines), then the problem vanishes, so I suspect there is some global state which is not being protected by a lock. I also saw that using multiple physics engines where exactly ONE engine uses the spatial hash does not cause a crash, so I suspect that chipmunk uses global state for its hashing.

I've tried a range of (cell_size, cell_num) values, from (0.5, 1000) to (500, 10000), with the same result. I'm simulating a swarm of ~1000 foot-bots, using 16 physics engines (equally dividing a square arena).

A traceback with gdb is as follows:

Code: Select all

>>> bt #0 0x00007ffff27c748b in cpHandleRelease (hand=0x0, pooledHandles=0x5555556fd840) at /opt/jharwell/git/argos3/src/plugins/simulator/physics_engines/dynamics2d/chipmunk-physics/src/cpSpaceHash.c:71 #1 0x00007ffff27c7658 in clearTableCell (hash=0x5555556fa6d0, idx=1358) at /opt/jharwell/git/argos3/src/plugins/simulator/physics_engines/dynamics2d/chipmunk-physics/src/cpSpaceHash.c:118 #2 0x00007ffff27c76c3 in clearTable (hash=0x5555556fa6d0) at /opt/jharwell/git/argos3/src/plugins/simulator/physics_engines/dynamics2d/chipmunk-physics/src/cpSpaceHash.c:130 #3 0x00007ffff27c8411 in cpSpaceHashReindexQuery (hash=0x5555556fa6d0, func=0x7ffff27b34e5 <collideShapes>, data=0x5555556f6ba0) at /opt/jharwell/git/argos3/src/plugins/simulator/physics_engines/dynamics2d/chip munk-physics/src/cpSpaceHash.c:464 #4 0x00007ffff27b2bdb in cpSpatialIndexReindexQuery (index=0x5555556fa6d0, func=0x7ffff27b34e5 <collideShapes>, data=0x5555556f6ba0) at /opt/jharwell/git/argos3/src/plugins/simulator/physics_engines/dynamics2d/ chipmunk-physics/include/cpSpatialIndex.h:233 #5 0x00007ffff27b3b29 in cpSpaceStep (space=0x5555556f6ba0, dt=0.02) at /opt/jharwell/git/argos3/src/plugins/simulator/physics_engines/dynamics2d/chipmunk-physics/src/cpSpaceStep.c:374 #6 0x00007ffff27d48d3 in argos::CDynamics2DEngine::Update (this=0x5555556f6750) at /opt/jharwell/git/argos3/src/plugins/simulator/physics_engines/dynamics2d/dynamics2d_engine.cpp:116 #7 0x00007ffff7f71b23 in argos::CSpaceMultiThreadBalanceQuantity::UpdateThreadPhysics (this=0x5555555e1f90, c_range=...) at /opt/jharwell/git/argos3/src/core/simulator/space/space_multi_thread_balance_quantity. cpp:377 #8 0x00007ffff7f7173d in argos::CSpaceMultiThreadBalanceQuantity::UpdateThread (this=0x5555555e1f90, un_id=13) at /opt/jharwell/git/argos3/src/core/simulator/space/space_multi_thread_balance_quantity.cpp:322 #9 0x00007ffff7f70683 in argos::LaunchUpdateThreadBalanceQuantity (p_data=0x5555593b8540) at /opt/jharwell/git/argos3/src/core/simulator/space/space_multi_thread_balance_quantity.cpp:44 #10 0x00007ffff6eec609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #11 0x00007ffff7a9e293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Any ideas what the issue could be?

Re: Can't use spatial_hash with multiple physics engines

Posted: Mon May 30, 2022 3:34 am
by pincy
That's an odd bug indeed. I have never encountered, but I can have a look at it after June 7th. Maybe we can set up a Zoom chat and discuss it, because it's an important problem.

Re: Can't use spatial_hash with multiple physics engines

Posted: Mon May 30, 2022 2:24 pm
by jharwell
Sure--pm me with some times that work for you and we can find a time to chat.

Re: Can't use spatial_hash with multiple physics engines

Posted: Mon May 30, 2022 7:57 pm
by jharwell
From https://chipmunk-physics.net/forum/viewtopic.php?t=2338 and the chipmunk author:
Chipmunk doesn't have any global shared state. You can safely run separate Chipmunk spaces in different threads.

It's not thread safe however. If you want to access a single space from multiple threads, you'll need to use mutexes.
After some digging, I found that the cpSpaceSegmentQuery() in dynamics2d_engine.cpp can be called by multiple robots simultaneously to update their light sensors/proximity sensors in parallel, which is the source of the issue. Placing a std::scoped_lock right before that call fixes the crash, but ARGoS is MUCH slower with lots of robots as a result.

If I dig into the chipmunk source, what seems to be happening is the the cpSpaceSegmentQuery() function always performs some kind of hash table reindexing, leading to a race condition when multiple robots (each updated in a different thread) perform parallel queries. I don't know why this is the case; I would not have thought a query would have updated chipmunk data structures. It looks like when the foot-bot dynamics2D physics model is deleted when the foot-bot is transferred between engines that the corresponding shapes/body are removed from the chipmunk instance as well. BUT, removing the shapes/body from the chipmunk space, does not (I think) remove them from the hash index. That seems to happen after every call to cpSpaceSegmentQuery() when spatial hashing is used (amortizing the cost of doing so I guess). At the end of each call to cpSpaceSegmentQuery(), the cpSegmentQuery_helper() function calls remove_orphaned_handles() to cleanup any entities which are no longer in the space, but still in the hash index.

ARGoS calls cpSpaceRemoveShape() for each shape attached to body when the dynamics2D object model is deleted, which is all I think would be necessary to purge remove shapes from the space AND the hash index. So I think there we are missing a way to force-purge a removed body/set of shapes from the hash index, or chipmunk doesn't give us a way to do that and we fundamentally can't use spatial hashing with multiple dynamics2d engines in ARGoS.

There isn't really anything in the chipmunk docs about how the spatial hash works/when reindexing is triggered, and there are minimal comments in the code, so I'm mostly guessing here. Hopefully this is helpful when you have time to take a look at this.