Too many neurons for NRP?


#1

Hi,

For my masters thesis, I am currently trying to recreate the experiment of @alexander_ugent published in this paper. In the NRP templates I found the “Tigrillo SNN Learning in Closed Loop” experiment, which is the closest implementation I could get my hands on. However this template is not as sophisticated as the paper’s experiment. For example, its SNN is much smaller (10 population à 100 neurons -> 1.000 neurons).

Unfortunately, when I increase the SNN size to its target value (300 columns à 40 neurons -> 12.000 neurons) the NRP throws an error or gets stuck.
When starting the experiment over the official NRP web fronted the simulation gets stuck at “Inizializing CLE”.
When trying to launch the same experiment on the online NRP servers over the virtual coach instead, I simply get an HTTP 500 error (but not stuck simulation):

INFO: [2022-09-20 21:54:15,518 - VirtualCoach] Preparing to launch tigrillo-cl-snn-learning_13.
INFO: [2022-09-20 21:54:15,520 - VirtualCoach] Retrieving list of experiments.
INFO: [2022-09-20 21:54:19,056 - VirtualCoach] Retrieving list of available servers.
INFO: [2022-09-20 21:54:19,130 - Simulation] Attempting to launch tigrillo-cl-snn-learning_13 on prod_latest_backend-11-80.
ERROR: [2022-09-20 21:54:31,512 - Simulation] Unable to launch on prod_latest_backend-11-80: Simulation responded with HTTP status 500
Traceback (most recent call last):
  File "/home/bbpnrsoa/nrp/src/VirtualCoach/hbp_nrp_virtual_coach/pynrp/simulation.py", line 158, in launch
    raise Exception(
Exception: Simulation responded with HTTP status 500

When using my local docker NRP, I observe the same behaviour for the VC launch. For the manual fronted launch however I receive an Error message (at step “Initzializing CLE”):

ERROR TYPE: UnknownError
ERROR CODE: -1
MESSAGE: An error occured. Please try again later. (host: 172.19.0.4)
STACK TRACE:
n@http://localhost:9000/scripts/app.bd5c68e5.js:1:16340 error@http://localhost:9000/scripts/app.bd5c68e5.js:1:16554 controller@http://localhost:9000/scripts/app.bd5c68e5.js:1:18466 invoke@http://localhost:9000/node_modules/angular/angular.js:4771:19 $controllerInit@http://localhost:9000/node_modules/angular/angular.js:10592:34 resolveSuccess@http://localhost:9000/node_modules/angular-ui-bootstrap/dist/ui-bootstrap-tpls.js:4154:34 processQueue@http://localhost:9000/node_modules/angular/angular.js:16696:28 qFactory/scheduleProcessQueue/<@http://localhost:9000/node_modules/angular/angular.js:16712:39 $eval@http://localhost:9000/node_modules/angular/angular.js:17994:28 $digest@http://localhost:9000/node_modules/angular/angular.js:17808:31 $apply@http://localhost:9000/node_modules/angular/angular.js:18102:24 done@http://localhost:9000/node_modules/angular/angular.js:12082:47 completeRequest@http://localhost:9000/node_modules/angular/angular.js:12291:15 requestError@http://localhost:9000/node_modules/angular/angular.js:12229:24

To reproduce the Issue, you need to change the following lines in the “CPG_brain.py” of the template:

  • In line 59: 5*2 to 300 (number of populations)
  • in line 93: 80 to 30 (Excitatory neurons per population)
  • In line 94: 20 to 10 (Inhibitory neurons per population)

I do not think that this is an code/experiment specific issue, because increased amounts of neurons generally work. It just takes much longer for the simulation to start the more neurons are requested (e.g. for 40x40=1600 neurons the startup takes some time but still works in the end). However, at some size the simulation startup seems to freeze completely. Therefore, I think this is an scalability issue within the NRP.

Do you have any idea how I can circumvent this? Is it normal for the NRP that a NEST network with 2000+ neurons is too much? Do you know how Alexander managed to run this simulation? (I also asked him directly via email in parallel but have no response yet). What else can I try?

Thanks for your help!

Best regards,
Felix

P.S.: The simulations that got stuck at startup on the server I cannot stop. Similar to the issue I mentioned in my last thread. It would be great if someone could again stop these simulations for me (and maybe fix the root cause…)


#2

Hi ge69yed,
Just to let you know I’ve brought this up with the team and I will also bring it up at our weekly standup to see if anyone has any ideas regarding the neuron #, and if we can figure out a way to prevent the issue with stopping the sim.


#3

Fyi, the issue is still being investigated by the dev team. The related ticket is
https://hbpneurorobotics.atlassian.net/browse/NUIT-300


#4

Update from my side:

I removed all PyNN dependencies and rewrote the whole network in pure NEST. As PyNN is incredibly inefficient when it comes to building larger networks, this reduced to creation-time for the network to < 60 seconds even for 12,000 Neurons (300 populations). So blocking at network creation is no issue anymore.

Unfortunately, now an HTTP 504 error occurs at roughly 1 min after launch.
I tested on 3 different clusters:

  • On the official NRP cluster this happens for > ~3,000 neurons
  • On my PC (docker, 16 core, 32GB RAM, 6GB VRAM) for > ~10,000 neurons
  • On a cluster @BenediktFeldotto provided (72 cores, idk how much RAM) for > ~18,000 neurons

A complete working simulation (at least in my docker) uses ~ 5GB of RAM so I assume its not a memory issue. VRAM usage and GPU utilisation is also pretty low, so I would rule out these too.

Within my docker, I took a look at /var/log/supervisor/ros-simulation-factory_app/ros-simulation-factory_app.out as well as ros-simulation-factory_app.err which both do not seems to throw any errors. My script just does not continue (crashes?) after some point without any error. Maybe, I am overlooking something. If you want I can provide the .out for a small (working) network and a big (crashing) network.


#5

I see you updated the related ticket #300 as well. Perfect. We will investigate more.


#6

Hello ge69yed,

Could you send us the content of the folder /var/log/supervisor/ and the experiment you are running as a zip? We’ll try to reproduce the issue. Please attach it to the Jira issue 300 (see above).


#7

done.

As I work with a VC script that modifies the experiment dynamically before launch, I uploaded a simplified hard-coded version. I do not think that makes a difference for the underlying issue we try to investigate. However, if it should be necessary I can make a simplified version of my VC code and upload that too.