Vanja Z | 21 Apr 01:36 2013
Picon

[OMPI users] QLogic HCA random crash after prolonged use

Hi all,

I have been trying to setup a cluster with QLogic QLE7140 HCA's and a Cisco SFS-7000 24 port switch. The
machines are running Debian Wheezy.

I have installed OpenMPI from the repos (1.4.5) and also
libibverbs1
libipathverbs1
libmthca1
librdmacm1
I have also tested OpenMPI compiled from the latest sources (1.6.4) with the same results. I modprobe
rdma_ucm, ib_umad and ib_uverbs in order to get MPI jobs to run.

I'm not actually sure if what I've done is enough to correctly configure the network but I have tested
several MPI capable codes that use Fortran and C, specifying the openib interface with the flag '--mca btl
openib,self'. Things initially work and the bandwidth is as expected, however after anywhere from 4 to 30
hours the jobs crash. The longest job that has completed successfully has gone for around 48 hours however
they rarely make it past 4 hours. This is always the error message.

[[36446,1],2][../../../../../../ompi/mca/btl/openib/btl_openib_component.c:3238:handle_wc]
from host84 to: host85 error polling LP CQ with status RETRY EXCEEDED ERROR
status number 12 for wr_id 36085024 opcode 2  vendor error 0 qp_idx 3
--------------------------------------------------------------------------
The OpenFabrics stack has reported a network error event.  Open MPI
will try to continue, but your job may end up failing.

  Local host:        host85
  MPI process PID:   7912
  Error number:      10 (IBV_EVENT_PORT_ERR)

(Continue reading)

Elken, Tom | 21 Apr 07:23 2013
Picon

Re: [OMPI users] QLogic HCA random crash after prolonged use

> I have seen it recommended to use psm instead of openib for QLogic cards.
[Tom] 
Yes.  PSM will perform better and be more stable when running OpenMPI than using verbs.  Intel has acquired
the InfiniBand assets of QLogic about a year ago.  These SDR HCAs are no longer supported, but should still
work.  You can get the driver (ib_qib) and PSM library from OFED 1.5.4.1 or the current release OFED 3.5.

With the current OFED 3.5 release there are included psm-release notes which start out this way (read down
to the OpenMPI build instructions for PSM):
"
              Open Fabrics Enterprise Distribution (OFED)
                    PSM in OFED 3.5 Release Notes

                          December 2012

======================================================================
1. Overview
======================================================================

The Performance Scaled Messaging (PSM) API is Intel's low-level user-level
communications interface for the Intel True Scale Fabric family of products.

The PSM libraries are included in the infinipath-psm-3.1-364.1140_open.src.rpm
source RPM and get built and installed as part of a default OFED
install process.

The primary way to use PSM is by compiling applications with an MPI that has
been built to use the PSM layer as its interface to QLogic HCAs.
PSM is the high-performance interface to the QLogic TrueScale HCAs.

Minimal instructions* for building two MPIs tested with OFED 
(Continue reading)

Dave Love | 24 Apr 17:58 2013
Picon

Re: [OMPI users] QLogic HCA random crash after prolonged use

"Elken, Tom" <tom.elken <at> intel.com> writes:

>> I have seen it recommended to use psm instead of openib for QLogic cards.
> [Tom] 
> Yes.  PSM will perform better and be more stable when running OpenMPI
> than using verbs.

But unfortunately you won't be able to checkpoint.

> Intel has acquired the InfiniBand assets of QLogic
> about a year ago.  These SDR HCAs are no longer supported, but should
> still work.

Do you mean they should work with the latest infinipath libraries
(despite what it said or implied in the notes for last version I got
from QLogic?) or possibly what's in RHEL?  I thought I'd actually tried
and failed with later stuff, but may just have gone by the release notes.

> You can get the driver (ib_qib) and PSM library from OFED 1.5.4.1 or
> the current release OFED 3.5.

I wonder if there's a version of the driver that's known to work in a
current RHEL5 system with QLE7140.  We get frequent qib-related kernel
panics from a vanilla RHEL5.9 kernel -- after running OK under test for
a few weeks, and nothing relevant appearing to have changed to cause
it...  (There's a trace on the redhat bugzilla with qib in the issue
title, for what it's worth.)  I'm currently reverting to old stuff.

It's good if Infinipath-land is taking an interest in OMPI again, and
that the libraries are now under a free licence.
(Continue reading)

Ralph Castain | 24 Apr 18:32 2013

Re: [OMPI users] QLogic HCA random crash after prolonged use


On Apr 24, 2013, at 8:58 AM, Dave Love <d.love <at> liverpool.ac.uk> wrote:

> "Elken, Tom" <tom.elken <at> intel.com> writes:
> 
>>> I have seen it recommended to use psm instead of openib for QLogic cards.
>> [Tom] 
>> Yes.  PSM will perform better and be more stable when running OpenMPI
>> than using verbs.
> 
> But unfortunately you won't be able to checkpoint.

True - yet remember that OMPI no longer supports checkpoint/restart after the 1.6 series. Pending a new
supporter coming along

> 
>> Intel has acquired the InfiniBand assets of QLogic
>> about a year ago.  These SDR HCAs are no longer supported, but should
>> still work.
> 
> Do you mean they should work with the latest infinipath libraries
> (despite what it said or implied in the notes for last version I got
> from QLogic?) or possibly what's in RHEL?  I thought I'd actually tried
> and failed with later stuff, but may just have gone by the release notes.
> 
>> You can get the driver (ib_qib) and PSM library from OFED 1.5.4.1 or
>> the current release OFED 3.5.
> 
> I wonder if there's a version of the driver that's known to work in a
> current RHEL5 system with QLE7140.  We get frequent qib-related kernel
(Continue reading)

Dave Love | 25 Apr 18:11 2013
Picon

Re: [OMPI users] QLogic HCA random crash after prolonged use

Ralph Castain <rhc <at> open-mpi.org> writes:

> On Apr 24, 2013, at 8:58 AM, Dave Love <d.love <at> liverpool.ac.uk> wrote:
>
>> "Elken, Tom" <tom.elken <at> intel.com> writes:
>> 
>>>> I have seen it recommended to use psm instead of openib for QLogic cards.
>>> [Tom] 
>>> Yes.  PSM will perform better and be more stable when running OpenMPI
>>> than using verbs.
>> 
>> But unfortunately you won't be able to checkpoint.
>
> True - yet remember that OMPI no longer supports checkpoint/restart
> after the 1.6 series. Pending a new supporter coming along

As far as I can tell, lack of PSM checkpointing isn't specific to OMPI,
and I know people have resorted to verbs to get it.

Dropped CR is definitely reason not to use OMPI past 1.6.  [By the way,
the release notes are confusing, saying that DMTCP is supported, but CR
is dropped.]  I'd have hoped a vendor who needs to support CR would
contribute, but I suppose changes just become proprietary if they move
the base past 1.6 :-(.

For general information, what makes the CR support difficult to maintain
-- is it just a question of effort?
Ralph Castain | 25 Apr 18:19 2013

Re: [OMPI users] QLogic HCA random crash after prolonged use


On Apr 25, 2013, at 9:11 AM, Dave Love <d.love <at> liverpool.ac.uk> wrote:

> Ralph Castain <rhc <at> open-mpi.org> writes:
> 
>> On Apr 24, 2013, at 8:58 AM, Dave Love <d.love <at> liverpool.ac.uk> wrote:
>> 
>>> "Elken, Tom" <tom.elken <at> intel.com> writes:
>>> 
>>>>> I have seen it recommended to use psm instead of openib for QLogic cards.
>>>> [Tom] 
>>>> Yes.  PSM will perform better and be more stable when running OpenMPI
>>>> than using verbs.
>>> 
>>> But unfortunately you won't be able to checkpoint.
>> 
>> True - yet remember that OMPI no longer supports checkpoint/restart
>> after the 1.6 series. Pending a new supporter coming along
> 
> As far as I can tell, lack of PSM checkpointing isn't specific to OMPI,
> and I know people have resorted to verbs to get it.
> 
> Dropped CR is definitely reason not to use OMPI past 1.6.  [By the way,
> the release notes are confusing, saying that DMTCP is supported, but CR
> is dropped.]  I'd have hoped a vendor who needs to support CR would
> contribute, but I suppose changes just become proprietary if they move
> the base past 1.6 :-(.

Not necessarily

(Continue reading)

Dave Love | 30 Apr 16:45 2013
Picon

Re: [OMPI users] QLogic HCA random crash after prolonged use

Ralph Castain <rhc <at> open-mpi.org> writes:

>> Dropped CR is definitely reason not to use OMPI past 1.6.  [By the way,
>> the release notes are confusing, saying that DMTCP is supported, but CR
>> is dropped.]  I'd have hoped a vendor who needs to support CR would
>> contribute, but I suppose changes just become proprietary if they move
>> the base past 1.6 :-(.
>
> Not necessarily

It looks so from here, but I'd be glad if not.

>> For general information, what makes the CR support difficult to maintain
>> -- is it just a question of effort?
>
> Largely a lack of interest. Very few (i.e., a handful) of people
> around the world use it,

That's surprising; I've certainly felt the lack of it from using PSM.

> and it is hard to justify putting in the
> effort for that small a user group.

Perhaps it's chicken and egg with use and support.
Elken, Tom | 24 Apr 18:52 2013
Picon

Re: [OMPI users] QLogic HCA random crash after prolonged use

> > Intel has acquired the InfiniBand assets of QLogic
> > about a year ago.  These SDR HCAs are no longer supported, but should
> > still work.
[Tom] 
I guess the more important part of what I wrote is that " These SDR HCAs are no longer supported" :)

> 
> Do you mean they should work with the latest infinipath libraries
> (despite what it said or implied in the notes for last version I got
> from QLogic?) or possibly what's in RHEL?  I thought I'd actually tried
> and failed with later stuff, but may just have gone by the release notes.
[Tom] 
Some testing from an Intel group who had these QLE7140 HCAs revealed to me that they do _not_ work with our
recent software stack such as IFS 7.1.1 (which includes OFED 1.5.4.1) .

They were able to get them to work with the QLogic OFED+ 6.0.2 stack.  That corresponds to OFED 1.5.2 -- that
was the first OFED to include PSM.  
I am providing this info as a courtesy, but not making any guarantees that it will work.

> 
> > You can get the driver (ib_qib) and PSM library from OFED 1.5.4.1 or
> > the current release OFED 3.5.
> 
> I wonder if there's a version of the driver that's known to work in a
> current RHEL5 system with QLE7140.  We get frequent qib-related kernel
> panics from a vanilla RHEL5.9 kernel
[Tom] 
The older QLogic and OFED stacks mentioned above were not ported to nor tested with RHEL 5.9, which did not
exist at the time.  Sorry.

(Continue reading)

Dave Love | 25 Apr 18:24 2013
Picon

Re: [OMPI users] QLogic HCA random crash after prolonged use

"Elken, Tom" <tom.elken <at> intel.com> writes:

>> > Intel has acquired the InfiniBand assets of QLogic
>> > about a year ago.  These SDR HCAs are no longer supported, but should
>> > still work.
> [Tom] 
> I guess the more important part of what I wrote is that " These SDR HCAs are no longer supported" :)

Sure, though from our point of view, they never were.  Good riddance to
that cluster vendor, who should have gone out of business earlier.

> [Tom] 
> Some testing from an Intel group who had these QLE7140 HCAs revealed to me that they do _not_ work with our
recent software stack such as IFS 7.1.1 (which includes OFED 1.5.4.1) .

I suspect I had done the experiment too.

> They were able to get them to work with the QLogic OFED+ 6.0.2 stack.
> That corresponds to OFED 1.5.2 -- that was the first OFED to include
> PSM.

In case it helps for anyone else trying this with old kit:  I had been
using a v6.something, but I'd have to check the something.  Using the
set of "updates" modules built with that and the latest kernel also
provokes the crashes, binary compatibility or not.

> I am providing this info as a courtesy, but not making any guarantees
> that it will work.

Understood, and thanks.
(Continue reading)


Gmane