Backend breaks after one experiment


#1

Hey Guys,

after i updated to the newest version of the NRP recently (~3 days ago, ./update_nrp update all), the backend seems to break after one experiment. What i mean by that is the following:

  1. cle-frontend
  2. cle-start
  3. start your experiment of choice
  4. close your experiment
  5. try to restart the experiment
  6. get this:

2018-01-29 16:21:41,250 [uWSGIWorker1] [hbp_nrp_back] [ERROR] ‘Error while communicating with the CLE (timeout exceeded while waiting for service /ros_cle_simulation/create_new_simulation)’ (CLE error)
Traceback (most recent call last):
File “/home/ben/.opt/platform_venv/lib/python2.7/site-packages/flask/app.py”, line 1612, in full_dispatch_request
rv = self.dispatch_request()
File “/home/ben/.opt/platform_venv/lib/python2.7/site-packages/flask/app.py”, line 1598, in dispatch_request
return self.view_functionsrule.endpoint
File “/home/ben/.opt/platform_venv/lib/python2.7/site-packages/flask_restful/init.py”, line 480, in wrapper
resp = resource(*args, **kwargs)
File “/home/ben/.opt/platform_venv/lib/python2.7/site-packages/flask/views.py”, line 84, in view
return self.dispatch_request(*args, **kwargs)
File “/home/ben/.opt/platform_venv/lib/python2.7/site-packages/flask_restful/init.py”, line 595, in dispatch_request
resp = meth(*args, **kwargs)
File “./hbp_nrp_backend/rest_server/__SimulationService.py”, line 176, in post
sim.state = “initialized”
File “./hbp_nrp_backend/simulation_control/__Simulation.py”, line 237, in state
self.__lifecycle.accept_command(new_state)
File “/home/ben/Documents/NRP/ExDBackend/hbp_nrp_commons/hbp_nrp_commons/simulation_lifecycle.py”, line 182, in accept_command
self.__machine.events[command].trigger()
File “/home/ben/.opt/platform_venv/lib/python2.7/site-packages/transitions/extensions/locking.py”, line 22, in trigger
super(LockedEvent, self).trigger(*args, **kwargs)
File “/home/ben/.opt/platform_venv/lib/python2.7/site-packages/transitions/core.py”, line 222, in trigger
return self.machine.process(f)
File “/home/ben/.opt/platform_venv/lib/python2.7/site-packages/transitions/extensions/locking.py”, line 15, in call
return self.func(*args, **kwargs)
File “/home/ben/.opt/platform_venv/lib/python2.7/site-packages/transitions/core.py”, line 526, in process
return trigger()
File “/home/ben/.opt/platform_venv/lib/python2.7/site-packages/transitions/core.py”, line 247, in _trigger
if t.execute(event):
File “/home/ben/.opt/platform_venv/lib/python2.7/site-packages/transitions/core.py”, line 145, in execute
machine.callback(func, event_data)
File “/home/ben/.opt/platform_venv/lib/python2.7/site-packages/transitions/extensions/locking.py”, line 15, in call
return self.func(*args, **kwargs)
File “/home/ben/.opt/platform_venv/lib/python2.7/site-packages/transitions/core.py”, line 516, in callback
func(event_data)
File “./hbp_nrp_backend/simulation_control/__BackendSimulationLifecycle.py”, line 223, in initialize
“CLE error”)
NRPServicesGeneralException: ‘Error while communicating with the CLE (timeout exceeded while waiting for service /ros_cle_simulation/create_new_simulation)’ (CLE error)

Sometimes, you don’t get an exception and everything works fine, but the experiments seem to get cached now? Any changes you make after an initial cle-start seem disregarded in the next run of the experiment (if it does not break like above), only a cle-kill cle-start gets your changes through then.

Regards,
Ben

@camilo


#2

The first experiment that you ran crashed massively and left the simulation server beyond repair. Normally, the simulation server should be able to recognize situations like that and should be able to unpublish the /ros_cle_simulation/create_new_simulation service. Could you doublecheck that this service is still active? Just do rosservice list and post the output here.

Apart from that, the error message is not really helpful. There must be another error message in the simulation server a bit more up in the logs. The error message that you posted basically just says that your simulation server does not respond.


#3

Hey,

this is the output of rosservice list (after the second exp. does not start):

/nrp_backend/get_loggers
/nrp_backend/set_logger_level
/ros_cle_simulation/is_simulation_running
/ros_cle_simulation/version
/ros_cle_simulation_9264_1517920862429/get_loggers
/ros_cle_simulation_9264_1517920862429/set_logger_level
/rosbridge_websocket/get_loggers
/rosbridge_websocket/set_logger_level
/rosout/get_loggers
/rosout/set_logger_level
/sm_ClosedLoop/get_loggers
/sm_ClosedLoop/set_logger_level
/web_video_server/get_loggers
/web_video_server/set_logger_level

I also checked for another error message, but there is none. The only thing that might be related is the following (comes up on shutdown of the first experiment):

[hbp_nrp_cles] [INFO] Cleaning up ROS nodes and services
[WARN] [1517921091.066349]: Could not process inbound connection: [/ros_cle_simulation_9264_1517920862429] is not a publisher of [/ros_cle_simulation/status]. Topics are [[’/ros_cle_simulation/logs’, ‘cle_ros_msgs/ClientLoggerMessage’], [’/rosout’, ‘rosgraph_msgs/Log’], [’/robot/arm_2_joint/cmd_pos’, ‘std_msgs/Float64’]]{‘message_definition’: ‘string data\n’, ‘callerid’: ‘/rosbridge_websocket’, ‘tcp_nodelay’: ‘0’, ‘md5sum’: ‘992ce8a1687cec8c8bd883ec73ca41d1’, ‘topic’: ‘/ros_cle_simulation/status’, ‘type’: ‘std_msgs/String’}
2018-02-06 13:44:51,066 [Thread-3 ] [rosout ] [WARNING] Could not process inbound connection: [/ros_cle_simulation_9264_1517920862429] is not a publisher of [/ros_cle_simulation/status]. Topics are [[’/ros_cle_simulation/logs’, ‘cle_ros_msgs/ClientLoggerMessage’], [’/rosout’, ‘rosgraph_msgs/Log’], [’/robot/arm_2_joint/cmd_pos’, ‘std_msgs/Float64’]]{‘message_definition’: ‘string data\n’, ‘callerid’: ‘/rosbridge_websocket’, ‘tcp_nodelay’: ‘0’, ‘md5sum’: ‘992ce8a1687cec8c8bd883ec73ca41d1’, ‘topic’: ‘/ros_cle_simulation/status’, ‘type’: ‘std_msgs/String’}
[INFO] [1517921091.068874]: [Client 2] Unsubscribed from /ros_cle_simulation/status
[INFO] [1517921091.078200]: Client disconnected. 0 clients total.

Regards,
Ben


#4

OK, in that case it is as I described above: The simulation server is still running, but it basically recognized there was a problem with a previous simulation. This problem is severe in the sense that the previously running simulation cannot be completed. This usually is the case when some resources hang or are trapped in infinite loops, for example in Transfer Functions.

In that case, the simulation server simply unpublishes the create_new_simulation service such that it would not accept new simulations. The proxy did not query that information, yet, which is why the frontend did not tell you that your locally started simulation server is unavailable.

On the servers, we have a script that periodically scans all the simulation servers for situations like these and then simply restarts the servers, but locally we do not have that: You have to restart your simulation server in that case.


#5

Yeah,
thats what i do right now anyways… Was just curious because the problem came up right after the NRP update i did :wink:

Regards,
Ben


#6

I am afraid without an error message of what went wrong, I can’t tell you exactly why the simulation server got stuck. Maybe there is some other error message that you didn’t see?

For example, we often had this issue in the past because some external component segfaulted. In that case, all you can see in the log is “segfault (core dumped)” instead of a long stacktrace that is easy to see. If it really is a segfault, then the error diagnosis becomes a more tremendous problem, however.