Skip to content
GitLab
    • Explore Projects Groups Topics Snippets
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in
  • F FLAG
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 7
    • Issues 7
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • ras-devel
  • FLAG
  • Issues
  • #1
Closed
Open
Issue created 6 years ago by Mitch Burnett@mcbOwner
  • New related issue

  • New related issue

Processing pipeline hangs

Open

Processing pipeline hangs

One of the major problems we have in the processing pipeline are the hangs in our hashpipe plugin.

The reason why it hangs is unknown. We are unsure if it is limits in the network, memory limits being reached, or if we are to slow on the I/O transfers on and off the GPU device (i.e., I/O bandwidth). More work needs to be done to identify the stalls.

Currently there is one case of stalling that is reported by the net thread (it may be all cases that this occurs, but not sure). This is a print statement saying NET: HANGING HERE!!!. This is in the net thread when waiting for either the input or output buffer to be free. To get into this state we missed are hanging up stream somewhere and the semaphore is not triggering to release the buffers.

Edited 6 years ago
  1. Oh no!

    You are trying to upload something other than an image. Please upload a .png, .jpg, .jpeg, .gif, .bmp, .tiff or .ico.

    Incoming!

    Drop your designs to start your upload.
Tasks
...

Linked items
...

    Related merge requests

    Activity


    • Mitch Burnett changed title from Hashpipe hangs to Processing pipeline hangs 6 years ago

      changed title from Hashpipe hangs to Processing pipeline hangs

    • Mitch Burnett changed the description 6 years ago

      changed the description

    • Mitch Burnett mentioned in issue #2 (closed) 6 years ago

      mentioned in issue #2 (closed)

    • Mitch Burnett mentioned in issue #3 6 years ago

      mentioned in issue #3

    • Mitch Burnett
      Mitch Burnett @mcb · 6 years ago
      Author Owner

      CUDA streams (#2 (closed)) did not help here.

      Implementing pinned memory was also included in !1 (merged) and did improve the memory transfer + kernel execution speed (decrease from 35 ms to 7 ms, 5x improvement) but this did not resolve the hangs.

      At this point CUDA optimizations do not seem to be the way to fix the hangs and something else is causing us to hang. At this point we are going to try to see if we cannot pinpoint exactly what is going on. Because with the CUDA code operating faster it seems that I/O on/off the GPU is not the issue.

    • Mitch Burnett
      Mitch Burnett @mcb · 6 years ago
      Author Owner

      We have spent a lot of time looking at CUDA code optimizations to resolve hangs and now it is time to isolate other areas of the system that could be causing hangs.

      • Should be monitoring /var/log/messages and all relevant OS messages to see if we can't learn if the OS is telling us something about the hangs
        • Nothing is currently showing up in /var/log/messages but it doesn't look like the system is logging a lot of information and so there may be a setting we need to adjust to start getting relevant log messages (i.e., we are looking for something like a virtual memory swap being triggered and logged by the OS or something similar).
        • The kernel logging level was increased from 3 to 7. (This can be checked by checking the output of cat /proc/sys/kernel/printk. After doing this we started to see log messages coming through. At the moment there are three different types of errors that show up. Each will be getting their own issue opened.
          1. Hashpipe reports a segmentation fault in libbeamformer.so
          2. Several processes report a "page allocation error" with order: 0 and mode:0x20
          3. The bfFitsWriter process reports a segmentation fault in libpthread.so. In the process of debugging and carefully watching the log messages this error comes up when the fits writer processes are forced to closed and so while having to force-quit the fits writer, this is not relevant to hangs.
      1. We know (or at least have an idea) of where the hangs are in the thread execution (this is indicated by the HANGING HERE message) and while we hang and this message isn't displayed that is most likely because we don't have that HANGING HERE message printed to stdout in all the processing threads. So from here we could
      • Add hanging here messages to all while loops similar to the net thread
      • Work on logic to gracefully manage hangs (#3)

      Look at the comment regarding the net_thread below. But, it turns that writing better control flow in our packet processing logic will be what we need to do to resolve this issue.

      1. Looking at the FITS writer and other portions of the pipeline
      • Bypass the FITS writer altogether and using the hashpipe 'null_output_thread'
      • Use the dummy fits writer for the FITS process
      • Go through full fits formatting but instead write to '/dev/null'

      Even when using the dummy FITS writer, writing to '/dev/null' and just writing binary data to lustre all experience hangs on a similar order to that of what the RTBF mode already experiences. In one experiment we ran the system writing data lustre for 3 hours split into 15 min scans and in every scan at least one bank would hang. Most the time 2-4 banks would hang and at least one scan saw 6 banks hang. This indicates there are problems further up the pipeline.

      While we did test that the null_output_thread was working, we have did a long experiment as mentioned above when writing to lustre. However, I expect there to be hangs.

      1. GBO/FLAG infiniband (IB) network architecture - (i.e., is the switch and the lustre keeping up?)
        • Trace the IB route see if we are going through GBO IB backbone
        • Send data out the FITS writer through IB and just have the IB switch drop packets

      We didn't get to testing these cases since operation without the FITS writer also fails indicating problems further up the pipeline.

      Edited 6 years ago by Mitch Burnett
    • Mitch Burnett
      Mitch Burnett @mcb · 6 years ago
      Author Owner

      We have observed that a hang can occur when running all 4 instances of hashpipe on a single machine.

    • Mitch Burnett
      Mitch Burnett @mcb · 6 years ago
      Author Owner

      Going through the DIBAS configuration information we stumbled across some VEGAS development notes that provided system/kernel level configuration parameters for the network. The notes stated that they were the Mellanox recommended settings for multi-threaded programs. We have applied those changes in hopes that it would help reduce the number of 'Bad Block' (what we have been equating to lost/dropped packets). But after applying them and restarting we cant seem to see a difference.

      However, we took VEGAS development at there word. We should go back and read through the source for these mellanox recommended settings. Those can be found at: Mellanox adapter tuning guidelines

    • Mitch Burnett
      Mitch Burnett @mcb · 6 years ago
      Author Owner

      Using gdb debugger we are able to stop a process while it is actively running and step through execution and examine where it is running. When a hang had manifested itself we attached to the process with gdb and began stepping through the execution and we noticed that when we "hang" (or what we have been calling a hang) we are stuck in a control flow loop in the net_thread that passes through the same case through each pass of the state machine.

      As mentioned, the problem is in the net_thread. Notice here that there is nothing done to reinitialize the packet handling process. The return value is -1 which equates to a logic error in the next control flow statement after the process_packet call and after doing some ancillary control flow checking on if the state has changed we drop back to the top of the while (run_threads()) { and since we are still in the ACQUIRE state we again receive a packet and have again reached process_packet. And since nothing had changed from the last pass in terms of reinitializing the mcnt we begin the same loop again.

      The next step I think will fully prove this is to simulate at a slow rate mcnts which would enforce that control branch to be taken to see if we can enter the control loop. An idea would be to use the MATLAB packet generator and provide a way to throw specified mcnts at it.

      Edited 6 years ago by Mitch Burnett
    Please register or sign in to reply
    0 Assignees
    None
    Assign to
    Labels
    0
    None
    0
    None
      Assign labels
    • Manage project labels

    Milestone
    No milestone
    None
    Due date
    None
    None
    None
    Time tracking
    Confidentiality
    Not confidential

    You are going to turn on confidentiality. Only project members with at least the Reporter role, the author, and assignees can view or be notified about this issue.

    Lock issue
    Unlocked
    participants
    Reference:

    Menu

    Explore Projects Groups Topics Snippets