[CLOSED] Virtual Coach - Simulation responded with HTTP status 504

ge69yed · 2022-08-31 10:47:01 UTC

Hey Guys,

I am trying to use the VC to run multiple experiments after another and there are two occasions when I get the HTTP 504 error.

The first is presumably a bug and occurs when I call vc._VirtualCoach__get_available_files() while the csv_records folder has many items in it.

Output is:

[...]
Input In [8], in computeIterationServer_singleThread(inputs, CAMfunction, blacklist, instanceLimit, sleeptime, maxTries, osimPath)
    144 lastData = None
    146 print(f'STEP 5.1: Get CSV file list...')
--> 147 csv_files = vc._VirtualCoach__get_available_files(experiment_id=params["exp_id"],file_type="csv")
    148 keys = natsort.natsorted(csv_files.keys())[-requestedInstances:] #the last N names
    150 print(f'STEP 5.2: Read CSV files...')

File ~/nrp/src/VirtualCoach/hbp_nrp_virtual_coach/pynrp/virtual_coach.py:624, in VirtualCoach.__get_available_files(self, experiment_id, file_type)
    620 response = requests.get(self.__config['proxy-services']['files'] % \
    621                         (experiment_id, file_type,), headers=self.__http_headers)
    623 if response.status_code != http.client.OK:
--> 624     raise Exception('Error when getting files Status Code: %d. Error: %s'
    625                     % (response.status_code, response))
    626 files = json.loads(response.content)
    627 distinct_runs = defaultdict(dict)

Exception: Error when getting files Status Code: 504. Error: <Response [504]>

The error occurs reliably/reproducible for a full csv_records folder and it disappears after emptying the folder manually though the web-interface.
As my script/approach is based on many short runs with different parameters I cant really avoid generating many csv files. Emptying the folder manually is rather un-practical (especially when running the script over night).

As a workaround, is there a way to delete files from the vc? I only found functions to read files. In the long run, however it would be nice to not have this error occur at all or to have a possibility for some temporary CSV output. I really only need the CSV output once after the run. Do I overlook some existing solution for this case? Do you have hints/ideas on that?

The second situation where the HTTP 504 occurs is when I call vc.launch_experiment() multiple times.
To explain when this happens, I would dive a bit into the actual script as the error does not occur right at the start but after some time.

I do multiple iterations of the following steps:

(Re)create VC instance via vc = VirtualCoach(...)
Launch N experiments on N different available servers after each other via sim = vc.launch_experiment(...)

vc.launch_experiment() blocks for ca. 25-45 seconds. As I use a for-loop, these calls are not made in rapid succession.
I save the resulting sim objects in a list for later usage

For each experiment, change the individual simulation parameters in the transfer functions with sim.get_transfer_function(...) and sim.edit_transfer_function(...)
Start all simulations (sim.start()) using for loop
Check the state of each simulation (sim.get_state()) until they are all finished and then stop them (sim.stop())
(Re)create VC instance via vc = VirtualCoach(...)
Get list of CSV files (vc._VirtualCoach__get_available_files(...)) and read the last N files.
Some task specific computations on the csv data

One of these iterations runs just fine (if the csv_records folder was emptied beforehand) in the most cases. Even more potentially run fine. However, after some time (e.g. somewhere between 3 and 30 runs) sim = vc.launch_experiment(...)in step 3 throws and http 504 error.

The logging output before the Exception is:

[...]
INFO: [2022-08-30 21:44:38,625 - VirtualCoach] Retrieving list of available servers.
INFO: [2022-08-30 21:44:38,699 - VirtualCoach] Checking server availability for: prod_latest_backend-7-80
INFO: [2022-08-30 21:44:38,700 - Simulation] Attempting to launch nrpexperiment_robobrain_mouse_0_0_0_0_0 on prod_latest_backend-7-80.
ERROR: [2022-08-30 21:45:28,798 - Simulation] Unable to launch on prod_latest_backend-7-80: Simulation responded with HTTP status 504
Traceback (most recent call last):
  File "/home/bbpnrsoa/nrp/src/VirtualCoach/hbp_nrp_virtual_coach/pynrp/simulation.py", line 158, in launch
    raise Exception(
Exception: Simulation responded with HTTP status 504

The exception output:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
[...]

Input In [8], in launchSequential(vc, params, requestedInstances, availableServers, sleeptime)
     94 print(f'[{i}]: Try to launch {params["exp_id"]} on {server}...', end = ' ')
     95 try:
     96     #vc = VirtualCoach(environment='http://148.187.149.212', oidc_username='ge69yed', oidc_password="1qayxc1qayxcTUM")
     97     #time.sleep(10)
     98     #availableServers=getAvailableServer(vc,blacklist)
     99     #server=availableServers[0]
--> 101     sim = vc.launch_experiment(params["exp_id"],server=server) #[NRP issue]: somthimes this fails.
    102     sims.append(sim)
    103     success = True

File ~/nrp/src/VirtualCoach/hbp_nrp_virtual_coach/pynrp/virtual_coach.py:289, in VirtualCoach.launch_experiment(self, experiment_id, server, reservation, cloned, brain_processes, profiler, recordingPath)
    286         logger.info(e)
    288 # simulation launch unsuccessful, abort
--> 289 raise Exception('Simulation launch failed, consult the logs or try again later.')

Exception: Simulation launch failed, consult the logs or try again later.

The logging output afterwards:

INFO: [2022-08-30 21:47:59,238 - roslibpy] WebSocket connection closed: Code=1006, Reason=connection was closed uncleanly ("peer dropped the TCP connection without previous WebSocket closing handshake")
INFO: [2022-08-30 21:47:59,239 - twisted] <twisted.internet.tcp.Connector instance at 0x7f61250e4370 disconnected IPv4Address(type='TCP', host='148.187.149.212', port=80)> will retry in 2 seconds
INFO: [2022-08-30 21:47:59,239 - twisted] Stopping factory <roslibpy.comm.comm_autobahn.AutobahnRosBridgeClientFactory object at 0x7f612007ec10>
INFO: [2022-08-30 21:48:02,090 - twisted] Starting factory <roslibpy.comm.comm_autobahn.AutobahnRosBridgeClientFactory object at 0x7f612007ec10>
INFO: [2022-08-30 21:48:02,221 - roslibpy] Connection to ROS MASTER ready.

[a bunch of times the above 5 lines]

INFO: [2022-08-30 21:58:05,099 - roslibpy] WebSocket connection closed: Code=1006, Reason=connection was closed uncleanly ("peer dropped the TCP connection without previous WebSocket closing handshake")
INFO: [2022-08-30 21:58:05,099 - twisted] <twisted.internet.tcp.Connector instance at 0x7f61250e4370 disconnected IPv4Address(type='TCP', host='148.187.149.212', port=80)> will retry in 2 seconds
INFO: [2022-08-30 21:58:05,100 - twisted] Stopping factory <roslibpy.comm.comm_autobahn.AutobahnRosBridgeClientFactory object at 0x7f612007ec10>
INFO: [2022-08-30 21:58:07,117 - twisted] Starting factory <roslibpy.comm.comm_autobahn.AutobahnRosBridgeClientFactory object at 0x7f612007ec10>
INFO: [2022-08-30 21:58:07,279 - twisted] failing WebSocket opening handshake ('WebSocket connection upgrade failed [401]: Unauthorized')
WARNING: [2022-08-30 21:58:07,280 - twisted] dropping connection to peer tcp4:148.187.149.212:80 with abort=True: WebSocket connection upgrade failed [401]: Unauthorized

[repeating the above 6 lines forever]

If you want/need I can also provide the actual source code.

Some context:

I run the virtual coach in the latest NRP docker image
Leaving out step 6. did not change anything
Putting sleep statements between 0 and 30s after each step did not help either
This error occurs no matter how big N is (amount of “parallel” simulations) -> only for N=2 or N=3 it works for some reason
The N servers I use are normally the same. So the first simulation of iteration 1 is normally on the same server as the first simulation of iteration 2.
I did not see a pattern of servers which a prone to causing this error. Every time its a different one and none seem unaffected.

The reason why I do not simply make one full run of a simulation after each other ( also allowing to use get_experiment_run_file for the CSV) is that I intend to replace the for loop of blocking launches in step 1 with actual (thread)parallel launches. Real parallel utilisation of multiple server would greatly speed up my approach. However, when calling vc.launch_experiment() in only two (N=2) parallel threads (delay of 2s) on the same vc object, it is a certain guarantee to get a HTTP 504 error. The output before the exception looks like this:

[...]
[t=0]: Try to launch nrpexperiment_robobrain_mouse_0_0_0_0_0 on prod_latest_backend-7-80... 

INFO: [2022-08-31 12:30:55,367 - VirtualCoach] Preparing to launch nrpexperiment_robobrain_mouse_0_0_0_0_0.
INFO: [2022-08-31 12:30:55,368 - VirtualCoach] Retrieving list of experiments.

[t=1]: Try to launch nrpexperiment_robobrain_mouse_0_0_0_0_0 on prod_latest_backend-7-443... 

INFO: [2022-08-31 12:30:57,368 - VirtualCoach] Preparing to launch nrpexperiment_robobrain_mouse_0_0_0_0_0.
INFO: [2022-08-31 12:30:57,369 - VirtualCoach] Retrieving list of experiments.
INFO: [2022-08-31 12:31:04,642 - VirtualCoach] Retrieving list of available servers.
INFO: [2022-08-31 12:31:04,768 - VirtualCoach] Checking server availability for: prod_latest_backend-7-80
INFO: [2022-08-31 12:31:04,769 - Simulation] Attempting to launch nrpexperiment_robobrain_mouse_0_0_0_0_0 on prod_latest_backend-7-80.
INFO: [2022-08-31 12:31:06,045 - VirtualCoach] Retrieving list of available servers.
INFO: [2022-08-31 12:31:06,097 - VirtualCoach] Checking server availability for: prod_latest_backend-7-443
INFO: [2022-08-31 12:31:06,098 - Simulation] Attempting to launch nrpexperiment_robobrain_mouse_0_0_0_0_0 on prod_latest_backend-7-443.
ERROR: [2022-08-31 12:31:54,868 - Simulation] Unable to launch on prod_latest_backend-7-80: Simulation responded with HTTP status 504
Traceback (most recent call last):
  File "/home/bbpnrsoa/nrp/src/VirtualCoach/hbp_nrp_virtual_coach/pynrp/simulation.py", line 158, in launch
    raise Exception(
Exception: Simulation responded with HTTP status 504
ERROR: [2022-08-31 12:31:56,199 - Simulation] Unable to launch on prod_latest_backend-7-443: Simulation responded with HTTP status 504
Traceback (most recent call last):
  File "/home/bbpnrsoa/nrp/src/VirtualCoach/hbp_nrp_virtual_coach/pynrp/simulation.py", line 158, in launch
    raise Exception(
Exception: Simulation responded with HTTP status 504

The actual exception message looks like this:

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "<ipython-input-8-eedb071500b0>", line 38, in launchServer
    raise e
  File "<ipython-input-8-eedb071500b0>", line 31, in launchServer
    sim = vc.launch_experiment(params["exp_id"],server=server) #[NRP issue]: somthimes this fails.
  File "/home/bbpnrsoa/nrp/src/VirtualCoach/hbp_nrp_virtual_coach/pynrp/virtual_coach.py", line 289, in launch_experiment
    raise Exception('Simulation launch failed, consult the logs or try again later.')
Exception: Simulation launch failed, consult the logs or try again later.
"""

The above exception was the direct cause of the following exception:

Exception                                 Traceback (most recent call last)

[...]

Input In [8], in launchParalell(vc, params, requestedInstances, availableServers, sleeptime)
     64 #RUN threads
     65 with Pool(requestedInstances) as p:
---> 66     runOutputs = p.map(launchServer, args)
     68 #only use the runs where the server started sucessfully
     69 sims = []

File /usr/lib/python3.8/multiprocessing/pool.py:364, in Pool.map(self, func, iterable, chunksize)
    359 def map(self, func, iterable, chunksize=None):
    360     '''
    361     Apply `func` to each element in `iterable`, collecting the results
    362     in a list that is returned.
    363     '''
--> 364     return self._map_async(func, iterable, mapstar, chunksize).get()

File /usr/lib/python3.8/multiprocessing/pool.py:771, in ApplyResult.get(self, timeout)
    769     return self._value
    770 else:
--> 771     raise self._value

Exception: Simulation launch failed, consult the logs or try again later.

Whether the parallel thread 504 error and the for-loop serial launch 504 error are related I do not know.

It would be great if some of you have some ideas how to fix or circumvent this 504 error. Are there maybe some other projects which (successfully) use multiple servers (in parallel) in succession?

Thanks in advance,
Felix

vonarnim · 2022-08-31 12:44:07 UTC

Hi Felix,
My first guess is that, because you start new VC instances in each iteration, you might try to read the CSV files from a previous VC. And as cvs file names are encoded with the date and time, after a minute, newly generated files have a different naming.
Likewise, when you launch_experiment() multiple times with different VC instances, check if the 504 error occurs exactly upon a minute change, which would point to a similar issue.
For your sequential use case, I would suggest toi instantiate the VC only once and use the one vc to launch all experiments. See the example below.
For your parallel use case, be sure that you keep pointers to your vc instances to be sure to call launch and get_files on the correct instance and avoid mix up.

I would suggest you sort out the sequential use case first.
(There is more than needed in the following code, it is not runnable, just pasted it for your inspiration)

vc = VirtualCoach(environment='http://172.19.0.2:9000', storage_username='nrpuser', storage_password='*******')
exp_id='my_experiment'

dir_path=os.environ['HBP']+'my_path'

def run_episode(record=False): 

    # Launch the experiment when a backend is available
    while (len(vc._VirtualCoach__get_available_server_list())==0):
        rospy.logwarn("Waiting for available backends")
        time.sleep(0.5)
        
    while True:
        try:         
            sim=vc.launch_experiment(exp_id)
            break
        except Exception as e:
            rospy.logwarn("Exception: "+ str(e))

    # Start the simulation
    if (sim.get_state() == 'paused'):
        if (record): sim.start_recording()
        sim.start()
    
    # Wait until the end of the simulation
    # Evaluate thread implementation
    while (sim.get_state() == 'started'):
        pass

def set_training_mode():
    with open(dir_path + '.test_flag.txt','w') as f:
        f.write('F')

def set_test_mode():
    with open(dir_path + '.test_flag.txt','w') as f:
        f.write('T')

def set_random_mode():
    with open(dir_path + '.random_flag.txt','w') as f:
        f.write('T')

def unset_random_mode():
    with open(dir_path + '.random_flag.txt','w') as f:
        f.write('F')

def run_eval_episode(record=False):
    set_test_mode()
    run_episode(record=record)
    set_training_mode()
    R=read_reward_csv()
    return R

def reset_reward_csv():
    csv_path=os.environ['HBP']+'my_path'
    with open(csv_path, 'w') as f:
        _writer = csv.DictWriter(f, fieldnames=['return'])
        _writer.writeheader()
        #_writer.writerow({'return':self.cumulative_reward})

def append_reward_csv():
    csv_path=os.environ['HBP']+'my_path'+'test_returns.csv'
    with open(csv_path, 'a') as f:
        _writer = csv.DictWriter(f, fieldnames=['return'])
        #_writer.writeheader()
        _writer.writerow({'return':self.cumulative_reward})

def read_reward_csv():
    R=None
    csv_path=os.environ['HBP']+'my_path'+'test_returns.csv'
    with open(csv_path, newline='') as f:
        _reader = csv.reader(f)
        R = float(f.readlines()[-1])
    return R

for ep in range(total_episodes):
    rospy.loginfo("################# Episode number " + str(ep+1) + " ##############################")
    run_episode()
    episode_idx+=1
    t+=max_ep_len
    incorporate_episodic_RB()

    # Evaluate Policy
    if (ep % eval_every == 0):
        rospy.loginfo("############ Evaluation - episode number " + str(ep+1) + " ##########################")
        record = True if (ep % record_every == 0) else False
        R=run_eval_episode(record=record) 

        <... etc ....>

ge69yed · 2022-09-01 10:57:34 UTC

Concerning CSV files:

When I call vc._VirtualCoach__get_available_files() I indeed get the file names (and other stuff) for all CSV file generated from this experiment (btw. the CSV files are generated by a transfer function using the CSV recorder (@nrp.MapCSVRecorder)). However, I only read ( vc._VirtualCoach__get_file_content(exp_id=params["exp_id"],file_uuid=csvFile["uuid"]) ) the content from the last N files generated in the current iteration/epoch. To me it seems these files are stored with names having seconds precision. Especially in the sequential case, this should not produce naming conflicts.

Here a screenshot from the online file explorer:

FIleExplorer

And what I get back from vc._VirtualCoach__get_available_files(experiment_id=params["exp_id"],file_type="csv") also looks like having seconds-precision:

defaultdict(dict,
            {'2022-08-29_18-18-10': {'data.csv': {'folder': '2022-08-29_18-18-10',
               'name': 'data.csv',
               'uuid': 'nrpexperiment_robobrain_mouse_0_0_0_0_0%2Fcsv_records%2Fcsv_records_2022-08-29_18-18-10%2Fdata.csv',
               'size': 1202126,
               'type': 'file',
               'modifiedOn': '2022-08-29T16:19:34.877Z'}},
             '2022-08-29_18-18-56': {'data.csv': {'folder': '2022-08-29_18-18-56',
               'name': 'data.csv',
               'uuid': 'nrpexperiment_robobrain_mouse_0_0_0_0_0%2Fcsv_records%2Fcsv_records_2022-08-29_18-18-56%2Fdata.csv',
               'size': 1204903,
               'type': 'file',
               'modifiedOn': '2022-08-29T16:19:32.717Z'}},
             '2022-08-29_18-21-19': {'data.csv': {'folder': '2022-08-29_18-21-19',
               'name': 'data.csv',
               'uuid': 'nrpexperiment_robobrain_mouse_0_0_0_0_0%2Fcsv_records%2Fcsv_records_2022-08-29_18-21-19%2Fdata.csv',
               'size': 1199491,
               'type': 'file',
               'modifiedOn': '2022-08-29T16:22:33.787Z'}},
[...] }

So I am not quiet certain what you mean by that:

I think it would be easier to just use vc.get_last_run_file() so the VC does not need to provide me with the (potentially very large) list of existing CSV files. Instead I would get the most recent file directly. However - because I run N simulation before I get to the csv-reading step - I would have only access to the last one of these N simulation files. That is why I use _VirtualCoach__get_available_files in the first place. Are there any options that are neither get-all-file-names or get-only-the-last-file?

Concerning 504 error on sequential runs:

I was not sure which timing I should look at so I looked at the different launches indicated by the logger output lines saying

[< timestamp > - Simulation] Attempting to launch < experiment > on < server >.

when I try to run 24 simulations, it does not even finish the first iteration (X-axis: launch number, Y-axis: time to last launch (attempt)):

Where the lines end, the error occurred. I added a sleep of 10s between launches for the second line (orange) to see whether long startup times are the issue. However, the simulation crashes in the first experiment while all delays are under one minute, and in the second experiment, it does not crash when delays are longer than the crashing-delay from experiment 1.

I made a similar experiment for N=12. In this case, two full iterations could be completed (blue & orange) before it crashed at the end of the third one (yellow):

I don’t really see a correlation between the crash moment and the delays between launches.

On the other hand, what I notice for the N=24 case is that the first error occurred at 15:06 after the first simulation launch and the crash of the second experiment happened at 14:55 after the first simulation launched. Is at 15minutes a timeout/constant for something?
Unfortunately, this does not explain the N=12 case where the iterations are finished within 15min anyways.

Using one VC instance for everything did not help. There also occurs the 15 min threshold as only constant I currently observe.

If you have ideas, what to look at next, I gladly try it out.

vonarnim · 2022-09-02 15:33:48 UTC

Hi,

Because you mention crashing times that are close to 15mn, I’m just thinking that this is the default timeout for simulations. Could that be that one simulation times out eventually and for some reasons crashes the Gazebo backend likewise? Can you have a look at the logs in the particular instance that crashed?

General context question first: are you launching your simulation from remote servers (the official NRP servers), docker instances on your local computer or some other set up?

Best

ge69yed · 2022-09-03 13:04:57 UTC

Depends on what exactly the timeout is applied. When I launch N=24 simulation after each other it indeed takes more than 15 minutes. However, the simulations are not “running”. Starting of the simulations is done nearly simultaneously and all simulations finish under 1 minute. So if there is a timeout how long a launched idle simulation may exists, this is certainly passed. But actual Simulation time and actual computation time thresholds should not be passed in my opinion.

I have the latest NRP docker image running on my local PC. Inside this image I have the CV (pynrp) installed. From this docker container I start instances on the official online NRP servers. To the actual server logs I therefore have no access.

If you wonder why I do not simply run the VC directly on my PC: Because pynrp on ubuntu 22.04.1 crashes. Afaik, @v.vorobev is already looking into that (I send him my crash logs yesterday).

vonarnim · 2022-09-05 20:47:17 UTC

Ok. I’m not sure what’s happening there and we’ll try and reproduce your issue to figure out.
Can you please add your experiment files and your VC script as a zip file attached to the ticket below?
https://hbpneurorobotics.atlassian.net/browse/NUIT-296

Best

vonarnim · 2022-09-05 20:51:57 UTC

Just a thought. In sequential mode, you could try run the simulations on your own container instead of the official servers, and see if you get the same issue. You could then also read the logs.

vonarnim · 2022-09-07 14:56:55 UTC

Small update for the sake of this thread’s completeness: the issue is indeed related with timeouts. There is a bug in the timemout when set to “simulation” time (https://hbpneurorobotics.atlassian.net/browse/NRRPLT-8629). Setting it to “real” works it around. The other issue is being investigated with the developer team.

ge69yed · 2022-09-07 15:42:43 UTC

On my local server the sequential simulation does work without any issues. I could not reproduce any of the mentioned error (504 and CSV) on the local docker installation.

ge69yed · 2022-09-07 15:43:55 UTC

Sorry it took so long.
I uploaded a jupyter notebook with the code and my NRP experiment export to the Jira thread.

vonarnim · 2022-10-20 09:27:22 UTC

The issue is still under investigation by the dev team, fyi.

vonarnim · 2022-12-15 14:32:58 UTC

This issue has been closed with the following comment:
The problem, which caused this issue, is related to the bug with timeout type (NRRPLT-8629).

There is still a problem related to the authentication timeout, but it doesn’t relate specifically to the parallel runs, it’s common for all remote simulations. As a workaround, the VirtualCoach should log in every time before executing commands (in case two different calls happen in more than 15min time interval). There is a separate issue created https://hbpneurorobotics.atlassian.net/browse/NRRPLT-8714 - Request access