Simulation on SpiNNaker gets stuck


#1

Hello everyone,

I’m working on the SpiNNaker boards from your platform (HBP) and to get some experiences I use a simple task: I build a small network and presenting randomly chosen bars to the network (each bar for 100 ms and then a new bar and so on). I use the already existent Pfister and Gerstner Spiking Triplet rule to learn the weights between the input layer and a second layer.
My problem is: If I present 300 bars one after each other, the simulation runs without a problem, but if I want to present 500 bars, the simulation seems to get stuck. There are two processes (ID 130183 and ID 130213) which I have started on the 15th and 17th of June, but the counter of used SpiNNaker core-hours shows the same value for 2 days. So I assume, the processes are not running anymore.
I have already canceled another process before because this process gets stuck too, to see if its just bad luck.

I hope someone have a solution to this problem and if it is necessary, I can copy the source code here.

Best regards and thank you very much in advance,
René Larisch


#2

Hi,

Sorry we weren’t tracking this forum so this got through to us by other channels! I have now registered to receive notifications from here for Neuromorphic questions.

Regarding your jobs, they did seem to get stuck. I think that this was due to interference from another job that was printing a lot of output which causes an upload to the server. I have cancelled all the jobs, so these will show as error. If you want to try to resubmit them it should hopefully go through correctly this time, but let us know if they get stuck again.

Thanks,

Andrew :slight_smile:


#3

Hello Andrew,

thank you for your response.
I will start the process again, and come back here if there is still a problem.

Best regards,
René Larisch


#4

Hello Andrew,

I hope you see that, because it is the same problem and I won’t open a new topic.
My new process (ID 130257) get stuck again.
Could it be a problem, that in my process after 100ms simulation time a new input is set?
If only 300 input stimuli are shown, then the simulation runs perfectly (see ID 130256), but with 500 it get stuck.
I’m at the moment not so familiar with the internal processing of the SpiNNaker chips.
Or could it be another problem?

Thanks four help!
Best regards,
René


#5

Hi René,

We think this problem is related to a known bug that we are currently working on. I have been able to reproduce your error myself (though interestingly, it seems more prevalent when using the collab than running on a local board or directly from the server next to the big machine). I have also noticed that I get an occasional actual error, the trace of with ends with “SpinnmanIOException: IO Error: [Errno 22] Invalid argument” - have you seen this at all or not?

There are two workarounds for now; one is simply to run less input stimuli, as you have done already. The other is to add a short sleep after each run (so in your script in the loop after your call to sim.run(sim_duration), add time.sleep(0.1) - I’ve found this helps when trying to run things this morning, at least). You may still find that your jobs unavoidably hang, but hopefully less often…

Thanks,

Andy


#6

Hello Andy,

I looked in some of the older failed runnings and yes, the error appears sometimes. What did that mean?

Anyway, thank you for your help. I will try out to use a short sleep and see if the simulation will now finish.

Best regards,
René


#7

Hi Rene,

I am not completely sure exactly what that error means, but my feeling is that it is related to the problem you are having with hanging jobs, just manifesting itself in a slightly different way.

One thing I noticed with your script that should not affect a single set of n_stimuli runs by itself, but may have an effect further along the line is that there is no call to sim.end() in it. Ideally the big machine resets itself but it may be that boards are being left in an unfinished state that is causing problems elsewhere. Ideally, you should call sim.end() at the end of every script in order to make sure the boards you have been using get freed again correctly.

Andy