Edmund Sumbar | 31 May 21:43 2012
Picon
Picon

[OMPI users] seg fault with intel compiler

Hi,

I feel like a dope. I can't seem to successfully run the following simple test program (from Intel MPI distro) as a Torque batch job on a Cent OS 5.7 cluster with Open MPI 1.6 compiled using Intel compilers 12.1.0.233.

If I comment out MPI_Get_processor_name(), it works.

#include "mpi.h"
#include <stdio.h>
#include <string.h>

int
main (int argc, char *argv[])
{
    int i, rank, size, namelen;
    char name[MPI_MAX_PROCESSOR_NAME];
    MPI_Status stat;

    MPI_Init (&argc, &argv);

    MPI_Comm_size (MPI_COMM_WORLD, &size);
    MPI_Comm_rank (MPI_COMM_WORLD, &rank);
    MPI_Get_processor_name (name, &namelen);

    if (rank == 0) {

    printf ("Hello world: rank %d of %d running on %s\n", rank, size, name);

    for (i = 1; i < size; i++) {
        MPI_Recv (&rank, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat);
        MPI_Recv (&size, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat);
        MPI_Recv (&namelen, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat);
        MPI_Recv (name, namelen + 1, MPI_CHAR, i, 1, MPI_COMM_WORLD, &stat);
        printf ("Hello world: rank %d of %d running on %s\n", rank, size, name);
    }

    } else {

    MPI_Send (&rank, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
    MPI_Send (&size, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
    MPI_Send (&namelen, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
    MPI_Send (name, namelen + 1, MPI_CHAR, 0, 1, MPI_COMM_WORLD);

    }

    MPI_Finalize ();

    return (0);
}

The result I get is

[cl2n007:19441] *** Process received signal ***
[cl2n007:19441] Signal: Segmentation fault (11)
[cl2n007:19441] Signal code: Address not mapped (1)
[cl2n007:19441] Failing at address: 0x10
[cl2n007:19441] [ 0] /lib64/libpthread.so.0 [0x306980ebe0]
[cl2n007:19441] [ 1] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3) [0x2af078563113]
[cl2n007:19441] [ 2] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59) [0x2af0785658a9]
[cl2n007:19441] [ 3] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1 [0x2af078565596]
[cl2n007:19441] [ 4] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_class_initialize+0xaa) [0x2af078582faa]
[cl2n007:19441] [ 5] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so [0x2af07c3e1909]
[cl2n007:19441] [ 6] /lib64/libpthread.so.0 [0x306980677d]
[cl2n007:19441] [ 7] /lib64/libc.so.6(clone+0x6d) [0x3068cd325d]
[cl2n007:19441] *** End of error message ***
[cl2n006:11146] [[51262,0],8] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/nidmap.c at line 776
[cl2n006:11146] [[51262,0],8] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ess_tm_module.c at line 310
[cl2n006:11146] [[51262,0],8] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/odls_base_default_fns.c at line[cl2n007:19434] [[51262,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/nidmap.c at line 776
 2342
[cl2n007:19434] [[51262,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ess_tm_module.c at line 310
[cl2n007:19434] [[51262,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/odls_base_default_fns.c at line 2342
[cl2n005:13582] [[51262,0],9] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/nidmap.c at line 776
[cl2n005:13582] [[51262,0],9] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ess_tm_module.c at line 310
[cl2n005:13582] [[51262,0],9] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/odls_base_default_fns.c at line 2342

...more of the same...


$ ompi_info
                 Package: Open MPI root <at> jasper.westgrid.ca Distribution
                Open MPI: 1.6
   Open MPI SVN revision: r26429
   Open MPI release date: May 10, 2012
                Open RTE: 1.6
   Open RTE SVN revision: r26429
   Open RTE release date: May 10, 2012
                    OPAL: 1.6
       OPAL SVN revision: r26429
       OPAL release date: May 10, 2012
                 MPI API: 2.1
            Ident string: 1.6
                  Prefix: /lustre/jasper/software/openmpi/openmpi-1.6-intel
 Configured architecture: x86_64-unknown-linux-gnu
          Configure host: jasper.westgrid.ca
           Configured by: root
           Configured on: Wed May 30 13:56:39 MDT 2012
          Configure host: jasper.westgrid.ca
                Built by: root
                Built on: Wed May 30 14:35:10 MDT 2012
              Built host: jasper.westgrid.ca
              C bindings: yes
            C++ bindings: yes
      Fortran77 bindings: yes (all)
      Fortran90 bindings: yes
 Fortran90 bindings size: small
              C compiler: icc
     C compiler absolute: /lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/icc
  C compiler family name: INTEL
      C compiler version: 9999.20110811
            C++ compiler: icpc
   C++ compiler absolute: /lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/icpc
      Fortran77 compiler: ifort
  Fortran77 compiler abs: /lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/ifort
      Fortran90 compiler: ifort
  Fortran90 compiler abs: /lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/ifort
             C profiling: yes
           C++ profiling: yes
     Fortran77 profiling: yes
     Fortran90 profiling: yes
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: no, progress: no)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: no
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
         libltdl support: yes
   Heterogeneous support: no
 mpirun default --prefix: no
         MPI I/O support: yes
       MPI_WTIME support: gettimeofday
     Symbol vis. support: yes
   Host topology support: yes
          MPI extensions: affinity example
   FT Checkpoint support: no (checkpoint thread: no)
     VampirTrace support: yes
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.6)
              MCA memory: linux (MCA v2.0, API v2.0, Component v1.6)
           MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.6)
               MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.6)
               MCA carto: file (MCA v2.0, API v2.0, Component v1.6)
               MCA shmem: mmap (MCA v2.0, API v2.0, Component v1.6)
               MCA shmem: posix (MCA v2.0, API v2.0, Component v1.6)
               MCA shmem: sysv (MCA v2.0, API v2.0, Component v1.6)
           MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.6)
           MCA maffinity: hwloc (MCA v2.0, API v2.0, Component v1.6)
               MCA timer: linux (MCA v2.0, API v2.0, Component v1.6)
         MCA installdirs: env (MCA v2.0, API v2.0, Component v1.6)
         MCA installdirs: config (MCA v2.0, API v2.0, Component v1.6)
             MCA sysinfo: linux (MCA v2.0, API v2.0, Component v1.6)
               MCA hwloc: hwloc132 (MCA v2.0, API v2.0, Component v1.6)
                 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.6)
              MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.6)
           MCA allocator: basic (MCA v2.0, API v2.0, Component v1.6)
           MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.6)
                MCA coll: basic (MCA v2.0, API v2.0, Component v1.6)
                MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.6)
                MCA coll: inter (MCA v2.0, API v2.0, Component v1.6)
                MCA coll: self (MCA v2.0, API v2.0, Component v1.6)
                MCA coll: sm (MCA v2.0, API v2.0, Component v1.6)
                MCA coll: sync (MCA v2.0, API v2.0, Component v1.6)
                MCA coll: tuned (MCA v2.0, API v2.0, Component v1.6)
                  MCA io: romio (MCA v2.0, API v2.0, Component v1.6)
               MCA mpool: fake (MCA v2.0, API v2.0, Component v1.6)
               MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.6)
               MCA mpool: sm (MCA v2.0, API v2.0, Component v1.6)
                 MCA pml: bfo (MCA v2.0, API v2.0, Component v1.6)
                 MCA pml: csum (MCA v2.0, API v2.0, Component v1.6)
                 MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.6)
                 MCA pml: v (MCA v2.0, API v2.0, Component v1.6)
                 MCA bml: r2 (MCA v2.0, API v2.0, Component v1.6)
              MCA rcache: vma (MCA v2.0, API v2.0, Component v1.6)
                 MCA btl: ofud (MCA v2.0, API v2.0, Component v1.6)
                 MCA btl: openib (MCA v2.0, API v2.0, Component v1.6)
                 MCA btl: self (MCA v2.0, API v2.0, Component v1.6)
                 MCA btl: sm (MCA v2.0, API v2.0, Component v1.6)
                 MCA btl: tcp (MCA v2.0, API v2.0, Component v1.6)
                MCA topo: unity (MCA v2.0, API v2.0, Component v1.6)
                 MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.6)
                 MCA osc: rdma (MCA v2.0, API v2.0, Component v1.6)
                 MCA iof: hnp (MCA v2.0, API v2.0, Component v1.6)
                 MCA iof: orted (MCA v2.0, API v2.0, Component v1.6)
                 MCA iof: tool (MCA v2.0, API v2.0, Component v1.6)
                 MCA oob: tcp (MCA v2.0, API v2.0, Component v1.6)
                MCA odls: default (MCA v2.0, API v2.0, Component v1.6)
                 MCA ras: cm (MCA v2.0, API v2.0, Component v1.6)
                 MCA ras: loadleveler (MCA v2.0, API v2.0, Component v1.6)
                 MCA ras: slurm (MCA v2.0, API v2.0, Component v1.6)
                 MCA ras: tm (MCA v2.0, API v2.0, Component v1.6)
               MCA rmaps: load_balance (MCA v2.0, API v2.0, Component v1.6)
               MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.6)
               MCA rmaps: resilient (MCA v2.0, API v2.0, Component v1.6)
               MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.6)
               MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.6)
               MCA rmaps: topo (MCA v2.0, API v2.0, Component v1.6)
                 MCA rml: oob (MCA v2.0, API v2.0, Component v1.6)
              MCA routed: binomial (MCA v2.0, API v2.0, Component v1.6)
              MCA routed: cm (MCA v2.0, API v2.0, Component v1.6)
              MCA routed: direct (MCA v2.0, API v2.0, Component v1.6)
              MCA routed: linear (MCA v2.0, API v2.0, Component v1.6)
              MCA routed: radix (MCA v2.0, API v2.0, Component v1.6)
              MCA routed: slave (MCA v2.0, API v2.0, Component v1.6)
                 MCA plm: rsh (MCA v2.0, API v2.0, Component v1.6)
                 MCA plm: slurm (MCA v2.0, API v2.0, Component v1.6)
                 MCA plm: tm (MCA v2.0, API v2.0, Component v1.6)
               MCA filem: rsh (MCA v2.0, API v2.0, Component v1.6)
              MCA errmgr: default (MCA v2.0, API v2.0, Component v1.6)
                 MCA ess: env (MCA v2.0, API v2.0, Component v1.6)
                 MCA ess: hnp (MCA v2.0, API v2.0, Component v1.6)
                 MCA ess: singleton (MCA v2.0, API v2.0, Component v1.6)
                 MCA ess: slave (MCA v2.0, API v2.0, Component v1.6)
                 MCA ess: slurm (MCA v2.0, API v2.0, Component v1.6)
                 MCA ess: slurmd (MCA v2.0, API v2.0, Component v1.6)
                 MCA ess: tm (MCA v2.0, API v2.0, Component v1.6)
                 MCA ess: tool (MCA v2.0, API v2.0, Component v1.6)
             MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.6)
             MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.6)
             MCA grpcomm: hier (MCA v2.0, API v2.0, Component v1.6)
            MCA notifier: command (MCA v2.0, API v1.0, Component v1.6)
            MCA notifier: syslog (MCA v2.0, API v1.0, Component v1.6)


--
Edmund Sumbar
University of Alberta
+1 780 492 9360

_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
Jeff Squyres | 31 May 22:54 2012
Picon

Re: [OMPI users] seg fault with intel compiler

This type of error usually means that you are inadvertently mixing versions of Open MPI (e.g., version
A.B.C on one node and D.E.F on another node).

Ensure that your paths are setup consistently and that you're getting both the same OMPI tools in your $path
and the same libmpi.so in your $LD_LIBRARY_PATH.

On May 31, 2012, at 3:43 PM, Edmund Sumbar wrote:

> Hi,
> 
> I feel like a dope. I can't seem to successfully run the following simple test program (from Intel MPI
distro) as a Torque batch job on a Cent OS 5.7 cluster with Open MPI 1.6 compiled using Intel compilers 12.1.0.233.
> 
> If I comment out MPI_Get_processor_name(), it works.
> 
> #include "mpi.h"
> #include <stdio.h>
> #include <string.h>
> 
> int
> main (int argc, char *argv[])
> {
>     int i, rank, size, namelen;
>     char name[MPI_MAX_PROCESSOR_NAME];
>     MPI_Status stat;
> 
>     MPI_Init (&argc, &argv);
> 
>     MPI_Comm_size (MPI_COMM_WORLD, &size);
>     MPI_Comm_rank (MPI_COMM_WORLD, &rank);
>     MPI_Get_processor_name (name, &namelen);
> 
>     if (rank == 0) {
> 
>     printf ("Hello world: rank %d of %d running on %s\n", rank, size, name);
> 
>     for (i = 1; i < size; i++) {
>         MPI_Recv (&rank, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat);
>         MPI_Recv (&size, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat);
>         MPI_Recv (&namelen, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat);
>         MPI_Recv (name, namelen + 1, MPI_CHAR, i, 1, MPI_COMM_WORLD, &stat);
>         printf ("Hello world: rank %d of %d running on %s\n", rank, size, name);
>     }
> 
>     } else {
> 
>     MPI_Send (&rank, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
>     MPI_Send (&size, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
>     MPI_Send (&namelen, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
>     MPI_Send (name, namelen + 1, MPI_CHAR, 0, 1, MPI_COMM_WORLD);
> 
>     }
> 
>     MPI_Finalize ();
> 
>     return (0);
> }
> 
> The result I get is
> 
> [cl2n007:19441] *** Process received signal ***
> [cl2n007:19441] Signal: Segmentation fault (11)
> [cl2n007:19441] Signal code: Address not mapped (1)
> [cl2n007:19441] Failing at address: 0x10
> [cl2n007:19441] [ 0] /lib64/libpthread.so.0 [0x306980ebe0]
> [cl2n007:19441] [ 1]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3) [0x2af078563113]
> [cl2n007:19441] [ 2]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59) [0x2af0785658a9]
> [cl2n007:19441] [ 3] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1 [0x2af078565596]
> [cl2n007:19441] [ 4]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_class_initialize+0xaa) [0x2af078582faa]
> [cl2n007:19441] [ 5]
/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so [0x2af07c3e1909]
> [cl2n007:19441] [ 6] /lib64/libpthread.so.0 [0x306980677d]
> [cl2n007:19441] [ 7] /lib64/libc.so.6(clone+0x6d) [0x3068cd325d]
> [cl2n007:19441] *** End of error message ***
> [cl2n006:11146] [[51262,0],8] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
util/nidmap.c at line 776
> [cl2n006:11146] [[51262,0],8] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
ess_tm_module.c at line 310
> [cl2n006:11146] [[51262,0],8] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
base/odls_base_default_fns.c at line[cl2n007:19434] [[51262,0],7] ORTE_ERROR_LOG: Data unpack
would read past end of buffer in file util/nidmap.c at line 776
>  2342
> [cl2n007:19434] [[51262,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
ess_tm_module.c at line 310
> [cl2n007:19434] [[51262,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
base/odls_base_default_fns.c at line 2342
> [cl2n005:13582] [[51262,0],9] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
util/nidmap.c at line 776
> [cl2n005:13582] [[51262,0],9] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
ess_tm_module.c at line 310
> [cl2n005:13582] [[51262,0],9] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
base/odls_base_default_fns.c at line 2342
> 
> ...more of the same...
> 
> 
> $ ompi_info 
>                  Package: Open MPI root <at> jasper.westgrid.ca Distribution
>                 Open MPI: 1.6
>    Open MPI SVN revision: r26429
>    Open MPI release date: May 10, 2012
>                 Open RTE: 1.6
>    Open RTE SVN revision: r26429
>    Open RTE release date: May 10, 2012
>                     OPAL: 1.6
>        OPAL SVN revision: r26429
>        OPAL release date: May 10, 2012
>                  MPI API: 2.1
>             Ident string: 1.6
>                   Prefix: /lustre/jasper/software/openmpi/openmpi-1.6-intel
>  Configured architecture: x86_64-unknown-linux-gnu
>           Configure host: jasper.westgrid.ca
>            Configured by: root
>            Configured on: Wed May 30 13:56:39 MDT 2012
>           Configure host: jasper.westgrid.ca
>                 Built by: root
>                 Built on: Wed May 30 14:35:10 MDT 2012
>               Built host: jasper.westgrid.ca
>               C bindings: yes
>             C++ bindings: yes
>       Fortran77 bindings: yes (all)
>       Fortran90 bindings: yes
>  Fortran90 bindings size: small
>               C compiler: icc
>      C compiler absolute: /lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/icc
>   C compiler family name: INTEL
>       C compiler version: 9999.20110811
>             C++ compiler: icpc
>    C++ compiler absolute: /lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/icpc
>       Fortran77 compiler: ifort
>   Fortran77 compiler abs: /lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/ifort
>       Fortran90 compiler: ifort
>   Fortran90 compiler abs: /lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/ifort
>              C profiling: yes
>            C++ profiling: yes
>      Fortran77 profiling: yes
>      Fortran90 profiling: yes
>           C++ exceptions: no
>           Thread support: posix (MPI_THREAD_MULTIPLE: no, progress: no)
>            Sparse Groups: no
>   Internal debug support: no
>   MPI interface warnings: no
>      MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
>          libltdl support: yes
>    Heterogeneous support: no
>  mpirun default --prefix: no
>          MPI I/O support: yes
>        MPI_WTIME support: gettimeofday
>      Symbol vis. support: yes
>    Host topology support: yes
>           MPI extensions: affinity example
>    FT Checkpoint support: no (checkpoint thread: no)
>      VampirTrace support: yes
>   MPI_MAX_PROCESSOR_NAME: 256
>     MPI_MAX_ERROR_STRING: 256
>      MPI_MAX_OBJECT_NAME: 64
>         MPI_MAX_INFO_KEY: 36
>         MPI_MAX_INFO_VAL: 256
>        MPI_MAX_PORT_NAME: 1024
>   MPI_MAX_DATAREP_STRING: 128
>            MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.6)
>               MCA memory: linux (MCA v2.0, API v2.0, Component v1.6)
>            MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.6)
>                MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.6)
>                MCA carto: file (MCA v2.0, API v2.0, Component v1.6)
>                MCA shmem: mmap (MCA v2.0, API v2.0, Component v1.6)
>                MCA shmem: posix (MCA v2.0, API v2.0, Component v1.6)
>                MCA shmem: sysv (MCA v2.0, API v2.0, Component v1.6)
>            MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.6)
>            MCA maffinity: hwloc (MCA v2.0, API v2.0, Component v1.6)
>                MCA timer: linux (MCA v2.0, API v2.0, Component v1.6)
>          MCA installdirs: env (MCA v2.0, API v2.0, Component v1.6)
>          MCA installdirs: config (MCA v2.0, API v2.0, Component v1.6)
>              MCA sysinfo: linux (MCA v2.0, API v2.0, Component v1.6)
>                MCA hwloc: hwloc132 (MCA v2.0, API v2.0, Component v1.6)
>                  MCA dpm: orte (MCA v2.0, API v2.0, Component v1.6)
>               MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.6)
>            MCA allocator: basic (MCA v2.0, API v2.0, Component v1.6)
>            MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.6)
>                 MCA coll: basic (MCA v2.0, API v2.0, Component v1.6)
>                 MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.6)
>                 MCA coll: inter (MCA v2.0, API v2.0, Component v1.6)
>                 MCA coll: self (MCA v2.0, API v2.0, Component v1.6)
>                 MCA coll: sm (MCA v2.0, API v2.0, Component v1.6)
>                 MCA coll: sync (MCA v2.0, API v2.0, Component v1.6)
>                 MCA coll: tuned (MCA v2.0, API v2.0, Component v1.6)
>                   MCA io: romio (MCA v2.0, API v2.0, Component v1.6)
>                MCA mpool: fake (MCA v2.0, API v2.0, Component v1.6)
>                MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.6)
>                MCA mpool: sm (MCA v2.0, API v2.0, Component v1.6)
>                  MCA pml: bfo (MCA v2.0, API v2.0, Component v1.6)
>                  MCA pml: csum (MCA v2.0, API v2.0, Component v1.6)
>                  MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.6)
>                  MCA pml: v (MCA v2.0, API v2.0, Component v1.6)
>                  MCA bml: r2 (MCA v2.0, API v2.0, Component v1.6)
>               MCA rcache: vma (MCA v2.0, API v2.0, Component v1.6)
>                  MCA btl: ofud (MCA v2.0, API v2.0, Component v1.6)
>                  MCA btl: openib (MCA v2.0, API v2.0, Component v1.6)
>                  MCA btl: self (MCA v2.0, API v2.0, Component v1.6)
>                  MCA btl: sm (MCA v2.0, API v2.0, Component v1.6)
>                  MCA btl: tcp (MCA v2.0, API v2.0, Component v1.6)
>                 MCA topo: unity (MCA v2.0, API v2.0, Component v1.6)
>                  MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.6)
>                  MCA osc: rdma (MCA v2.0, API v2.0, Component v1.6)
>                  MCA iof: hnp (MCA v2.0, API v2.0, Component v1.6)
>                  MCA iof: orted (MCA v2.0, API v2.0, Component v1.6)
>                  MCA iof: tool (MCA v2.0, API v2.0, Component v1.6)
>                  MCA oob: tcp (MCA v2.0, API v2.0, Component v1.6)
>                 MCA odls: default (MCA v2.0, API v2.0, Component v1.6)
>                  MCA ras: cm (MCA v2.0, API v2.0, Component v1.6)
>                  MCA ras: loadleveler (MCA v2.0, API v2.0, Component v1.6)
>                  MCA ras: slurm (MCA v2.0, API v2.0, Component v1.6)
>                  MCA ras: tm (MCA v2.0, API v2.0, Component v1.6)
>                MCA rmaps: load_balance (MCA v2.0, API v2.0, Component v1.6)
>                MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.6)
>                MCA rmaps: resilient (MCA v2.0, API v2.0, Component v1.6)
>                MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.6)
>                MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.6)
>                MCA rmaps: topo (MCA v2.0, API v2.0, Component v1.6)
>                  MCA rml: oob (MCA v2.0, API v2.0, Component v1.6)
>               MCA routed: binomial (MCA v2.0, API v2.0, Component v1.6)
>               MCA routed: cm (MCA v2.0, API v2.0, Component v1.6)
>               MCA routed: direct (MCA v2.0, API v2.0, Component v1.6)
>               MCA routed: linear (MCA v2.0, API v2.0, Component v1.6)
>               MCA routed: radix (MCA v2.0, API v2.0, Component v1.6)
>               MCA routed: slave (MCA v2.0, API v2.0, Component v1.6)
>                  MCA plm: rsh (MCA v2.0, API v2.0, Component v1.6)
>                  MCA plm: slurm (MCA v2.0, API v2.0, Component v1.6)
>                  MCA plm: tm (MCA v2.0, API v2.0, Component v1.6)
>                MCA filem: rsh (MCA v2.0, API v2.0, Component v1.6)
>               MCA errmgr: default (MCA v2.0, API v2.0, Component v1.6)
>                  MCA ess: env (MCA v2.0, API v2.0, Component v1.6)
>                  MCA ess: hnp (MCA v2.0, API v2.0, Component v1.6)
>                  MCA ess: singleton (MCA v2.0, API v2.0, Component v1.6)
>                  MCA ess: slave (MCA v2.0, API v2.0, Component v1.6)
>                  MCA ess: slurm (MCA v2.0, API v2.0, Component v1.6)
>                  MCA ess: slurmd (MCA v2.0, API v2.0, Component v1.6)
>                  MCA ess: tm (MCA v2.0, API v2.0, Component v1.6)
>                  MCA ess: tool (MCA v2.0, API v2.0, Component v1.6)
>              MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.6)
>              MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.6)
>              MCA grpcomm: hier (MCA v2.0, API v2.0, Component v1.6)
>             MCA notifier: command (MCA v2.0, API v1.0, Component v1.6)
>             MCA notifier: syslog (MCA v2.0, API v1.0, Component v1.6)
> 
> 
> -- 
> Edmund Sumbar
> University of Alberta
> +1 780 492 9360
> 
> _______________________________________________
> users mailing list
> users <at> open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--

-- 
Jeff Squyres
jsquyres <at> cisco.com
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Edmund Sumbar | 1 Jun 02:21 2012
Picon
Picon

Re: [OMPI users] seg fault with intel compiler

Thanks for the tip Jeff,

I wish it was that simple. Unfortunately, this is the only version installed. When I added --prefix to the mpiexec command line, I still got a seg fault, but without the backtrace. Oh well, I'll keep trying (compiler upgrade etc).

[cl2n022:03057] *** Process received signal ***
[cl2n022:03057] Signal: Segmentation fault (11)
[cl2n022:03057] Signal code: Address not mapped (1)
[cl2n022:03057] Failing at address: 0x10
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/nidmap.c at line 776
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ess_tm_module.c at line 310
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/odls_base_default_fns.c at line 2342
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/nidmap.c at line 776
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ess_tm_module.c at line 310
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/odls_base_default_fns.c at line 2342
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/nidmap.c at line 776
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ess_tm_module.c at line 310
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/odls_base_default_fns.c at line 2342
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/nidmap.c at line 776
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ess_tm_module.c at line 310
[cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/odls_base_default_fns.c at line 2342
[cl2n010:16470] *** Process received signal ***
[cl2n010:16470] Signal: Segmentation fault (11)
[cl2n010:16470] Signal code: Address not mapped (1)
[cl2n010:16470] Failing at address: 0x10
--------------------------------------------------------------------------
mpiexec noticed that process rank 32 with PID 3057 on node cl2n022 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------


On Thu, May 31, 2012 at 2:54 PM, Jeff Squyres <jsquyres <at> cisco.com> wrote:
This type of error usually means that you are inadvertently mixing versions of Open MPI (e.g., version A.B.C on one node and D.E.F on another node).



--
Edmund Sumbar
University of Alberta
+1 780 492 9360

_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
Jeff Squyres | 1 Jun 13:00 2012
Picon

Re: [OMPI users] seg fault with intel compiler

Try running:

which mpirun
ssh cl2n022 which mpirun
ssh cl2n010 which mpirun

and

ldd your_mpi_executable
ssh cl2n022 which mpirun
ssh cl2n010 which mpirun

Compare the results and ensure that you're finding the same mpirun on all nodes, and the same libmpi.so on
all nodes.  There may well be another Open MPI installed in some non-default location of which you're unaware.

On May 31, 2012, at 8:21 PM, Edmund Sumbar wrote:

> Thanks for the tip Jeff,
> 
> I wish it was that simple. Unfortunately, this is the only version installed. When I added --prefix to the
mpiexec command line, I still got a seg fault, but without the backtrace. Oh well, I'll keep trying
(compiler upgrade etc).
> 
> [cl2n022:03057] *** Process received signal ***
> [cl2n022:03057] Signal: Segmentation fault (11)
> [cl2n022:03057] Signal code: Address not mapped (1)
> [cl2n022:03057] Failing at address: 0x10
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
util/nidmap.c at line 776
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
ess_tm_module.c at line 310
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
base/odls_base_default_fns.c at line 2342
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
util/nidmap.c at line 776
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
ess_tm_module.c at line 310
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
base/odls_base_default_fns.c at line 2342
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
util/nidmap.c at line 776
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
ess_tm_module.c at line 310
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
base/odls_base_default_fns.c at line 2342
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
util/nidmap.c at line 776
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
ess_tm_module.c at line 310
> [cl2n022:03048] [[45689,0],7] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
base/odls_base_default_fns.c at line 2342
> [cl2n010:16470] *** Process received signal ***
> [cl2n010:16470] Signal: Segmentation fault (11)
> [cl2n010:16470] Signal code: Address not mapped (1)
> [cl2n010:16470] Failing at address: 0x10
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 32 with PID 3057 on node cl2n022 exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> 
> 
> On Thu, May 31, 2012 at 2:54 PM, Jeff Squyres <jsquyres <at> cisco.com> wrote:
> This type of error usually means that you are inadvertently mixing versions of Open MPI (e.g., version
A.B.C on one node and D.E.F on another node).
> 
> 
> 
> -- 
> Edmund Sumbar
> University of Alberta
> +1 780 492 9360
> 
> _______________________________________________
> users mailing list
> users <at> open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--

-- 
Jeff Squyres
jsquyres <at> cisco.com
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Edmund Sumbar | 1 Jun 16:03 2012
Picon
Picon

Re: [OMPI users] seg fault with intel compiler

On Fri, Jun 1, 2012 at 5:00 AM, Jeff Squyres <jsquyres <at> cisco.com> wrote:
Try running:

which mpirun
ssh cl2n022 which mpirun
ssh cl2n010 which mpirun

and

ldd your_mpi_executable
ssh cl2n022 which mpirun
ssh cl2n010 which mpirun

Compare the results and ensure that you're finding the same mpirun on all nodes, and the same libmpi.so on all nodes.  There may well be another Open MPI installed in some non-default location of which you're unaware.

I'll try that Jeff (results given below). However, I suspect there must be something goofy about this (brand new) cluster itself because among the countless jobs that failed, I got one job that ran without error, and all I ever did was to rearrange the echo and which commands. We've also observed some peculiar behaviour on this cluster using Intel MPI that seemed to be associated with the number of tasks requested. And after more experimentation, the Open MPI version of the program also seems to be sensitive to the number of tasks (e.g., works with 48, fails with 64).

Thanks for the feedback Jeff, but I think the ball is firmly in my court.



I ran the following PBS script with "qsub -l procs=128 job.pbs". Environment variables are set using the Environment Modules packages.

echo $HOSTNAME
which mpiexec
module load library/openmpi/1.6-intel

which mpiexec
echo $PATH
echo $LD_LIBRARY_PATH
ldd test-ompi16
mpiexec --prefix /lustre/jasper/software/openmpi/openmpi-1.6-intel ./test-ompi16

Standard output gave

cl2n011

/lustre/jasper/software/openmpi/openmpi-1.6-intel/bin/mpiexec

/lustre/jasper/software/openmpi/openmpi-1.6-intel/bin:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/bin/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64:/home/esumbar/local/bin:/home/esumbar/bin:/usr/kerberos/bin:/bin:/usr/bin:/opt/sgi/sgimc/bin:/usr/local/torque/sbin:/usr/local/torque/bin

/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64

    linux-vdso.so.1 =>  (0x00007fffb5358000)
    libmpi.so.1 => /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1 (0x00002b3968d1d000)
    libdl.so.2 => /lib64/libdl.so.2 (0x000000329ce00000)
    libimf.so => /lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64/libimf.so (0x00002b3969137000)
    libm.so.6 => /lib64/libm.so.6 (0x000000329d200000)
    librt.so.1 => /lib64/librt.so.1 (0x000000329da00000)
    libnsl.so.1 => /lib64/libnsl.so.1 (0x00000032a6400000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00000032a8400000)
    libsvml.so => /lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64/libsvml.so (0x00002b3969504000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000032a4c00000)
    libintlc.so.5 => /lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64/libintlc.so.5 (0x00002b3969c77000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x000000329d600000)
    libc.so.6 => /lib64/libc.so.6 (0x000000329ca00000)
    /lib64/ld-linux-x86-64.so.2 (0x000000329c200000)


Standard error gave

which: no mpiexec in (/home/esumbar/local/bin:/home/esumbar/bin:/usr/kerberos/bin:/bin:/usr/bin:/opt/sgi/sgimc/bin:/usr/local/torque/sbin:/usr/local/torque/bin)

[cl2n005:05142] *** Process received signal ***
[cl2n005:05142] Signal: Segmentation fault (11)
[cl2n005:05142] Signal code: Address not mapped (1)
[cl2n005:05142] Failing at address: 0x10
[cl2n005:05142] [ 0] /lib64/libpthread.so.0 [0x373180ebe0]
[cl2n005:05142] [ 1] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3) [0x2aff9aad5113]
[cl2n005:05142] [ 2] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59) [0x2aff9aad78a9]
[cl2n005:05142] [ 3] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1 [0x2aff9aad7596]
[cl2n005:05142] [ 4] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(ompi_free_list_grow+0x89) [0x2aff9aa0fa59]
[cl2n005:05142] [ 5] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(ompi_free_list_init_ex+0x9c) [0x2aff9aa0fd8c]
[cl2n005:05142] [ 6] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so [0x2aff9e94561c]
[cl2n005:05142] [ 7] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(mca_btl_base_select+0x130) [0x2aff9aa57930]
[cl2n005:05142] [ 8] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0xe) [0x2aff9e52bc1e]
[cl2n005:05142] [ 9] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(mca_bml_base_init+0x72) [0x2aff9aa570b2]
[cl2n005:05142] [10] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_pml_ob1.so [0x2aff9e1107e9]
[cl2n005:05142] [11] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(mca_pml_base_select+0x43e) [0x2aff9aa6592e]
[cl2n005:05142] [12] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(ompi_mpi_init+0x782) [0x2aff9aa276a2]
[cl2n005:05142] [13] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(MPI_Init+0xf4) [0x2aff9aa3f884]
[cl2n005:05142] [14] ./test-ompi16(main+0x4c) [0x400b5c]
[cl2n005:05142] [15] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3730c1d994]
[cl2n005:05142] [16] ./test-ompi16 [0x400a59]
[cl2n005:05142] *** End of error message ***
[cl2n006:32362] [[58962,0],5] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/nidmap.c at line 776
[cl2n006:32362] [[58962,0],5] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ess_tm_module.c at line 310
[cl2n006:32362] [[58962,0],5] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/odls_base_default_fns.c at line 2342
[cl2n003:04157] [[58962,0],8] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/nidmap.c at line 776
[cl2n003:04157] [[58962,0],8] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ess_tm_module.c at line 310
[cl2n003:04157] [[58962,0],8] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/odls_base_default_fns.c at line 2342
--------------------------------------------------------------------------
mpiexec noticed that process rank 77 with PID 5142 on node cl2n005 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------


--
Edmund Sumbar
University of Alberta
+1 780 492 9360

_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
Jeff Squyres | 1 Jun 16:09 2012
Picon

Re: [OMPI users] seg fault with intel compiler

On Jun 1, 2012, at 10:03 AM, Edmund Sumbar wrote:

> I ran the following PBS script with "qsub -l procs=128 job.pbs". Environment variables are set using the
Environment Modules packages.
> 
> echo $HOSTNAME
> which mpiexec
> module load library/openmpi/1.6-intel

This *may* be the problem here.

It's been a loooong time since I've run under PBS, so I don't remember if your script's environment is copied
out to the remote nodes where your application actually runs.

Can you verify that PATH and LD_LIBRARY_PATH are the same on all nodes in your PBS allocation after you
module load?

FWIW, since you've installed OMPI 1.6, you may want to uninstall the Open MPI that may have been installed by
your OS.

--

-- 
Jeff Squyres
jsquyres <at> cisco.com
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Edmund Sumbar | 1 Jun 17:26 2012
Picon
Picon

Re: [OMPI users] seg fault with intel compiler

On Fri, Jun 1, 2012 at 8:09 AM, Jeff Squyres <jsquyres <at> cisco.com> wrote:
It's been a loooong time since I've run under PBS, so I don't remember if your script's environment is copied out to the remote nodes where your application actually runs.

Can you verify that PATH and LD_LIBRARY_PATH are the same on all nodes in your PBS allocation after you module load?

I compiled the following program and invoked it with "mpiexec -bynode ./test-env" in a PBS script.

#include "mpi.h"
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main (int argc, char *argv[])
{
  int i, rank, size, namelen;
  MPI_Status stat;

  MPI_Init (&argc, &argv);

  MPI_Comm_size (MPI_COMM_WORLD, &size);
  MPI_Comm_rank (MPI_COMM_WORLD, &rank);

  printf("rank: %d: ld_library_path: %s\n", rank, getenv("LD_LIBRARY_PATH"));

  MPI_Finalize ();

  return (0);
}

I submitted the script with "qsub -l procs=24 job.pbs", and got

rank: 4: ld_library_path: /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64

rank: 3: ld_library_path: /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64

...more of the same...

When I submitted it with -l procs=48, I got

[cl2n004:11617] *** Process received signal ***
[cl2n004:11617] Signal: Segmentation fault (11)
[cl2n004:11617] Signal code: Address not mapped (1)
[cl2n004:11617] Failing at address: 0x10
[cl2n004:11617] [ 0] /lib64/libpthread.so.0 [0x376ca0ebe0]
[cl2n004:11617] [ 1] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3) [0x2af788a98113]
[cl2n004:11617] [ 2] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59) [0x2af788a9a8a9]
[cl2n004:11617] [ 3] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1 [0x2af788a9a596]
[cl2n004:11617] [ 4] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so [0x2af78c916654]
[cl2n004:11617] [ 5] /lib64/libpthread.so.0 [0x376ca0677d]
[cl2n004:11617] [ 6] /lib64/libc.so.6(clone+0x6d) [0x376bed325d]
[cl2n004:11617] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 4 with PID 11617 on node cl2n004 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

It seems that failures happen for arbitrary reasons. When I added a line in the PBS script to print out the node allocation, the procs=24 case failed, but then it worked a few seconds later, with the same list of allocated nodes. So there's definitely something amiss with the cluster, although I wouldn't know where to start investigating. Perhaps there is a pre-installed OMPI somewhere that's interfering, but I'm doubtful.

By the way, thanks for all the support.

--
Edmund Sumbar
University of Alberta
+1 780 492 9360

_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Gmane