Processing pipeline hangs

changed title from Hashpipe hangs to Processing pipeline hangs

changed the description

mentioned in issue #2 (closed)

mentioned in issue #3

CUDA streams (#2 (closed)) did not help here.

Implementing pinned memory was also included in !1 (merged) and did improve the memory transfer + kernel execution speed (decrease from 35 ms to 7 ms, 5x improvement) but this did not resolve the hangs.

At this point CUDA optimizations do not seem to be the way to fix the hangs and something else is causing us to hang. At this point we are going to try to see if we cannot pinpoint exactly what is going on. Because with the CUDA code operating faster it seems that I/O on/off the GPU is not the issue.

We have spent a lot of time looking at CUDA code optimizations to resolve hangs and now it is time to isolate other areas of the system that could be causing hangs.

Should be monitoring /var/log/messages and all relevant OS messages to see if we can't learn if the OS is telling us something about the hangs
- Nothing is currently showing up in /var/log/messages but it doesn't look like the system is logging a lot of information and so there may be a setting we need to adjust to start getting relevant log messages (i.e., we are looking for something like a virtual memory swap being triggered and logged by the OS or something similar).
- The kernel logging level was increased from 3 to 7. (This can be checked by checking the output of cat /proc/sys/kernel/printk. After doing this we started to see log messages coming through. At the moment there are three different types of errors that show up. Each will be getting their own issue opened.
  1. Hashpipe reports a segmentation fault in libbeamformer.so
  2. Several processes report a "page allocation error" with order: 0 and mode:0x20
  3. The bfFitsWriter process reports a segmentation fault in libpthread.so. In the process of debugging and carefully watching the log messages this error comes up when the fits writer processes are forced to closed and so while having to force-quit the fits writer, this is not relevant to hangs.

We know (or at least have an idea) of where the hangs are in the thread execution (this is indicated by the HANGING HERE message) and while we hang and this message isn't displayed that is most likely because we don't have that HANGING HERE message printed to stdout in all the processing threads. So from here we could

Add hanging here messages to all while loops similar to the net thread
Work on logic to gracefully manage hangs (#3)

Look at the comment regarding the net_thread below. But, it turns that writing better control flow in our packet processing logic will be what we need to do to resolve this issue.

Looking at the FITS writer and other portions of the pipeline

Bypass the FITS writer altogether and using the hashpipe 'null_output_thread'
Use the dummy fits writer for the FITS process
Go through full fits formatting but instead write to '/dev/null'

Even when using the dummy FITS writer, writing to '/dev/null' and just writing binary data to lustre all experience hangs on a similar order to that of what the RTBF mode already experiences. In one experiment we ran the system writing data lustre for 3 hours split into 15 min scans and in every scan at least one bank would hang. Most the time 2-4 banks would hang and at least one scan saw 6 banks hang. This indicates there are problems further up the pipeline.

While we did test that the null_output_thread was working, we have did a long experiment as mentioned above when writing to lustre. However, I expect there to be hangs.

GBO/FLAG infiniband (IB) network architecture - (i.e., is the switch and the lustre keeping up?)
- ~~Trace the IB route see if we are going through GBO IB backbone~~
- ~~Send data out the FITS writer through IB and just have the IB switch drop packets~~

We didn't get to testing these cases since operation without the FITS writer also fails indicating problems further up the pipeline.

We have observed that a hang can occur when running all 4 instances of hashpipe on a single machine.

Going through the DIBAS configuration information we stumbled across some VEGAS development notes that provided system/kernel level configuration parameters for the network. The notes stated that they were the Mellanox recommended settings for multi-threaded programs. We have applied those changes in hopes that it would help reduce the number of 'Bad Block' (what we have been equating to lost/dropped packets). But after applying them and restarting we cant seem to see a difference.

However, we took VEGAS development at there word. We should go back and read through the source for these mellanox recommended settings. Those can be found at: Mellanox adapter tuning guidelines

Using gdb debugger we are able to stop a process while it is actively running and step through execution and examine where it is running. When a hang had manifested itself we attached to the process with gdb and began stepping through the execution and we noticed that when we "hang" (or what we have been calling a hang) we are stuck in a control flow loop in the net_thread that passes through the same case through each pass of the state machine.

As mentioned, the problem is in the net_thread. Notice here that there is nothing done to reinitialize the packet handling process. The return value is -1 which equates to a logic error in the next control flow statement after the process_packet call and after doing some ancillary control flow checking on if the state has changed we drop back to the top of the while (run_threads()) { and since we are still in the ACQUIRE state we again receive a packet and have again reached process_packet. And since nothing had changed from the last pass in terms of reinitializing the mcnt we begin the same loop again.

The next step I think will fully prove this is to simulate at a slow rate mcnts which would enforce that control branch to be taken to see if we can enter the control loop. An idea would be to use the MATLAB packet generator and provide a way to throw specified mcnts at it.

Processing pipeline hangs

Oh no!

Incoming!

Tasks

Activity

Menu

Processing pipeline hangs

Oh no!

Incoming!

Tasks

Linked items ...

Related merge requests

Activity

Linked items
...