Hanging issue with parallelisation

Requests regarding how to set up experiments in ARGoS.
dmb
Posts: 14
Joined: Wed Jun 26, 2019 11:25 pm

Hanging issue with parallelisation

Postby dmb » Sun Jul 14, 2019 12:33 am

HI,

I am trying to parallelise over experiments, each of which consist of a set of trials (not parallelising over trials). Due to how the library works, I can't follow the mpga example exactly but instead the library I am uses requires use of tbb to parallelise over experiments.

I attached the code to show before the experiments start (start_code) and during a single experiment (experiment_code).

The issue that I am seeing is that, when I set the random seed to 0, the code works fine. Otherwise, for other seeds, the code hangs at some point before the first progress file is being written. It possibly has to do with flushing. I was looking at the definition of reset:

void CSimulator::Reset() {
...
if(m_bWasRandomSeedSet) {
..
}
else {
...
LOG << "[INFO] Using random seed = " << m_unRandomSeed << std::endl;
}
...
/ Reset the loop functions */
m_pcLoopFunctions->Reset();
LOG.Flush();
LOGERR.Flush();
}

I am just wondering if Flushing without writing could be problematic somehow ? More generally, when do you need to flush and is there any possibility it can hang the master ?

I also saw in mpga_example.cpp l.85-88:

Code: Select all

/* The master sleeps to give enough time to the slaves to * initialize and suspend properly. If not enough time is given * here, the master will hang later on. */ ::sleep(3);
Unfortunately the way the code is set-up, creating forks continuously rather than just initially, I cannot use sleep because the master would sleep after each experiment.

Could you suggest some ways fix this issue ?
Attachments
experiment_code.txt
(1.97 KiB) Downloaded 720 times
start_code.txt
(1.18 KiB) Downloaded 669 times

pincy
Site Admin
Posts: 632
Joined: Thu Mar 08, 2012 8:04 pm
Location: Boston, MA
Contact:

Re: Hanging issue with parallelisation

Postby pincy » Sun Jul 14, 2019 3:22 am

I don't know tbb - are you referring to https://www.threadingbuildingblocks.org/ ?

If so, consider that ARGoS is built around the Singleton pattern. This means that the only way to run parallel ARGoS instances is to use multiple processes, not threads. You can follow the MPGA example or use libraries like MPI, but you can't use libraries that handle multiple threads.

Regarding flushing: the ARGoS log is multi-threaded. When you run ARGoS with multiple threads, the log manager automatically creates a log instance for each thread. Each instance buffers its content until a LOG.Flush() (or LOGERR.Flush()) is called. When this happens, the contents of the log instances are printed to stdout (or stderr) in sequence, so that no overlap occurs.
I made ARGoS.

dmb
Posts: 14
Joined: Wed Jun 26, 2019 11:25 pm

Re: Hanging issue with parallelisation

Postby dmb » Sun Jul 14, 2019 1:32 pm

Perhaps my description was not as precise as it should be.

If you have a look at the experiment_code.txt, you can see there being a fork which creates a child process to run the simulator. So we are using multiple processes like the example, just in slightly different way.

The threading is running different instances of the experiment_code.txt, and inside this, a child process will be created.

pincy
Site Admin
Posts: 632
Joined: Thu Mar 08, 2012 8:04 pm
Location: Boston, MA
Contact:

Re: Hanging issue with parallelisation

Postby pincy » Sun Jul 14, 2019 3:25 pm

The code itself does not say much, it looks correct - it creates a memory-mapped file and forks. It's impossible to guess the problem from those code snippets. It would help to understand where the code hangs. You could use gdb to attach to a running instance of ARGoS and check what's going on.

I could be wrong, but I don't think flushing the log has anything to do with ARGoS hanging its execution. You can flush all the times you want, even when there's nothing to write, and that won't be a problem. If you redirect the log correctly, as you seem to do in start.txt, you should get one file per process with the log output as in MPGA.

I would probably refrain from using waitid() and instead use waitpid(), at least in a first implementation. This would allow you to capture all the state changes of the child processes, including cases in which a stop signal is delivered for some reason.
I made ARGoS.

dmb
Posts: 14
Joined: Wed Jun 26, 2019 11:25 pm

Re: Hanging issue with parallelisation

Postby dmb » Sun Jul 14, 2019 3:39 pm

In this case, there is only 1 log-file associated with the parent process. Otherwise, since new forks are being created and destroyed all the time, redirecting from the children would create many different files.

pincy
Site Admin
Posts: 632
Joined: Thu Mar 08, 2012 8:04 pm
Location: Boston, MA
Contact:

Re: Hanging issue with parallelisation

Postby pincy » Sun Jul 14, 2019 5:17 pm

I don't understand what you mean. In theory, every ARGoS instance should be associated with 2 log files, one for stdout and one for stderr. Those files should be process-specific, like in MPGA.
I made ARGoS.

dmb
Posts: 14
Joined: Wed Jun 26, 2019 11:25 pm

Re: Hanging issue with parallelisation

Postby dmb » Sun Jul 14, 2019 5:35 pm

The code runs experiment_code.txt all the time, creating new forks all the time and then destroying them once the variables are set and shared with the master. making process-specific files kind of goes out of the window as a potential approach when creating and destroying forks all the time (thousands of files would be created).

the experiment_code.txt has to be done in threads; the only problem is that argos is a singleton so that we cannot copy it. if we would be able to make copies of the CSimulator then we wouldn't need to use any forks at all. Could you perhaps suggest a way to clone a CSimulator ?

pincy
Site Admin
Posts: 632
Joined: Thu Mar 08, 2012 8:04 pm
Location: Boston, MA
Contact:

Re: Hanging issue with parallelisation

Postby pincy » Sun Jul 14, 2019 6:19 pm

I have not designed ARGoS with this kind of use case in mind, so I don't know how I would clone ARGoS instances, sorry. The MPGA example is pretty much the correct way to handle multiple instance of ARGoS fast - you create a pool of instances, use Reset() to configure them, and launch the instances you need.

I don't know where the approach you're following is coming from (i.e., what your constraints are), but I don't think going down that path will be efficient even if it ends up working. ARGoS is designed to take a little while to initialize, but be fast during the execution. Therefore, creating and destroying instances constantly is a waste of time.

If you want to avoid the log problem completely, just redirect all the logs to /dev/null.
I made ARGoS.


Return to “How to... ?”