Removing all robots from simulation causes crash

jharwell · Postby **jharwell** » Wed Jan 08, 2020 8:56 pm

Hi Carlo,

I'm now able to permanently remove robots at will using RemoveEntity(), as long as there is always one robot left in simulation after removal. But, if I remove the last robot, then I get a segmentation fault from ARGoS on line 347 of space_multi_thread_balance_quantity.cpp whenever n_threads is >= 1 (if I omit the attribute altogether, then I don't get an error).

Code: Select all

m_vecControllableEntities

has size 0 (as expected, as there are now no robots in simulation), but

Code: Select all

cEntityRange

still has size 1. I think this is caused by

Code: Select all

m_bIsControllableEntityAssignmentRecalculationNeeded

not being set in an instance when it needs to be to correctly update the robots assigned to a given thread, but I'm not sure where, because it is set in RemoveControllableEntity(), which is where I would think it needs to go.

Do you have any idea why this might happen? Is it related to ARGoS not starting unless there is at least one robot specified in the input file?

Thanks!

- John

pincy · Postby **pincy** » Wed Jan 08, 2020 9:32 pm

I don't think I designed ARGoS to have zero robots in the simulation. I'll have a look at it.

pincy · Postby **pincy** » Fri Jan 10, 2020 6:56 am

I just tested a simulation that starts with zero entities in it. It works perfectly, with any number of threads and any scheduling method. I used diffusion_1.argos as a starting point and simply commented out all the entities.

I also tried to start a simulation with about 70 robots and remove them all with RemoveEntity(). For this, I took the custom distributions example and simply added a PostStep() method that removes all the robots after 5 steps. Again, with any configuration of threads I tried, everything works perfectly.

How can I reproduce the crash you're encountering?

jharwell · Postby **jharwell** » Fri Jan 10, 2020 6:04 pm

I'm not sure--I'll try to come up with a minimal example that reproduces the crash I've been seeing.

jharwell · Postby **jharwell** » Tue Jan 14, 2020 6:42 pm

So I actually found the cause of this error when I was working to improve swarm iteration efficiency in the loop functions, and including the fix in my pull request: https://github.com/ilpincy/argos3/pull/124.

The issue was that while removing entities marked the entity range assigned to each thread as needing re-calculation, that re-calculation was only done before the Act phase. Chaning the swarm size in the loop functions PreStep/PostStep functions could (under certain conditions) cause a SenseControl phase thread to access the controller for a robot that had been already removed (because that phase was AFTER the Act phase in the update loop for a single timestep, and re-calculation would not occur until the start of the Act phase in the NEXT timestep), causing the segfault I saw.

pincy · Postby **pincy** » Tue Jan 14, 2020 7:01 pm

I see the potential bug now - the simplest solution would be to trigger a recalculation just before the SenseControl phase. I have a hard time understanding the rest of the code you wrote. What does it accomplish? Can you write a simple example of what you want to achieve with that pull request?

jharwell · Postby **jharwell** » Wed Jan 15, 2020 5:57 pm

Sure: here is the code for using this capability in the loop functions PostStep():

Code: Select all

  auto cb = [&](argos::CControllableEntity* robot) {                                                                                                                                          
    robot_post_step(dynamic_cast<argos::CFootBotEntity&>(robot->GetParent()));                                                                                                                
    caches_recreation_task_counts_collect(&                                                                                                                                                   
        static_cast<controller::base_controller&>(robot->GetController()));                                                                                                                   
  };
  IterateOverControllableEntities(cb);

Where robot_post_step() takes the controller associated with the controllable entity and does the following:

Code: Select all

/*                                                                                                                                                                                          
   * Watch the robot interact with its environment after physics have been                                                                                                                    
   * updated and its controller has run.                                                                                                                                                      
   */                                                                                                                                                                                         
  auto iadaptor =                                                                                                                                                                             
      robot_interactor_adaptor<robot_arena_interactor, interactor_status>(                                                                                                                    
          controller, rtypes::timestep(GetSpace().GetSimulationClock()));                                                                                                                     
  auto status =                                                                                                                                                                               
      boost::apply_visitor(iadaptor,                                                                                                                                                          
                           m_interactor_map->at(controller->type_index()));                                                                                                                   
                                                                              
  /*                                                                                                                                                                                          
   * Collect metrics from robot, now that it has finished interacting with the                                                                                                                
   * environment and no more changes to its state will occur this timestep.                                                                                                                   
   */                                                                                                                                                                                         
  auto madaptor =                                                                                                                                                                             
      robot_metric_extractor_adaptor<depth1_metrics_aggregator>(controller);                                                                                                                  
  boost::apply_visitor(madaptor,                                                                                                                                                              
                       m_metric_extractor_map->at(controller->type_index()));                                                                                                                 
  controller->block_manip_collator()->reset();

Basically, I use the functionality to (1) iterate over the swarm and collect metrics from each robot (what task they are currently executing, current location, current heading, collision avoidance status, etc), and (2) to have them interact with the environment, in terms of sending them events related to block pickup/drop, cache pickup/drop, and (3) get information about the current task each robot is executing for use in determining whether the loop functions should recreate an intermediate drop site between the food source and the nest (a cache) after it has been depleted by the swarm. By using this functionality, I can avoid having to maintain my own thread pool for these iteration operations, which need to happen every timestep, and are very slow to do without threads for large swarms (for small swarms, serial iteration is fine).

I started out using OpenMP to do the swarm iteration, which worked OK. But on a 16 core machine simulating a large swarm (say 16,000 robots), running ARGoS with 16 threads, I used an additional 16 OpenMP threads to do the iteration, which meant that the OS had to switch 32 threads in and out each timestep, which accrued non-negligible overhead, in addition to the overhead of the OpenMP thread scheduling algorithm itself. Maintaining my own pthread pool still accrues the OS context switching overhead (32 threads on a 16 core machine in the above example), but is faster than the OpenMP implementation. Using the thread pool in ARGoS results in 20-25% improved efficiency over the other two implementation options, which for the 24 hour cluster jobs I run is a significant 5-6 hour savings.

Removing all robots from simulation causes crash

Removing all robots from simulation causes crash

Re: Removing all robots from simulation causes crash

Re: Removing all robots from simulation causes crash

Re: Removing all robots from simulation causes crash

Re: Removing all robots from simulation causes crash

Re: Removing all robots from simulation causes crash

Re: Removing all robots from simulation causes crash