Erik Hugne | 26 Mar 17:27 2012
Picon

Fragmented Name publications lost in receiver

In TIPC 1.6.4, when the links are reestablished after a node reboot, the 
name table updates may be sent as fragmented messages (if the published 
names do not fit in a message < MTU)

These fragmented NAME_DISTRIBUTOR messages seem to be lost in the 
receiver somehow..
This was found on a cluster running kernel 2.6.32.

Have anyone ever experienced problems with these lost name table updates?

I have not been able to reproduce this on a 3.2 kernel.
When i tried, i found out that the name table update behavior was 
changed in the two commits below to never send them as fragments:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=149ce37c8de72c64fc4f66c1b4cf7a0fb66b7ee9
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=b4b5610223f17790419b03eaa962b0e3ecf930d7

//E

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
Suryanarayana Garlapati | 27 Mar 09:36 2012
Picon

Re: Fragmented Name publications lost in receiver

Hi Erik,
A similar type of issue, i had encountered with kernel 2.6.27 but with 
TIPC 1.7 and with multiple bearers.
For this issue, the broadcast link was getting congested and sender was 
not sending any next publications messages.

For this i proposed a fix and Al had provided a patch as well. You can 
see the following for more info.

http://tipc.git.sourceforge.net/git/gitweb.cgi?p=tipc/tipc;a=commitdiff;h=db556b4d16145cd2f04007236aeb02efcbfc6257

Regards
Surya

On 03/26/2012 08:57 PM, Erik Hugne wrote:
> In TIPC 1.6.4, when the links are reestablished after a node reboot, the
> name table updates may be sent as fragmented messages (if the published
> names do not fit in a message<  MTU)
>
> These fragmented NAME_DISTRIBUTOR messages seem to be lost in the
> receiver somehow..
> This was found on a cluster running kernel 2.6.32.
>
> Have anyone ever experienced problems with these lost name table updates?
>
> I have not been able to reproduce this on a 3.2 kernel.
> When i tried, i found out that the name table update behavior was
> changed in the two commits below to never send them as fragments:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=149ce37c8de72c64fc4f66c1b4cf7a0fb66b7ee9
(Continue reading)

Erik Hugne | 28 Mar 15:26 2012
Picon

Re: Fragmented Name publications lost in receiver

There are no congestions on any side, and I think this is a different 
issue that may or may not still exist when large fragmented buffers are 
being transmitted.

Pcap: http://www.sendspace.com/file/ryi49l
1.1.9 is rebooted, when links are established,
1.1.5 sends 2036 publications, fragmented into 28 messages to 1.1.9
(Packets 273639 -> 273720)

After this, a retransmission sequence 1.1.5 => 1.1.9 follows..

Link statistics on 1.1.9 shows that 28 fragments was received, but not 
reassembled (28/0).

Link <1.1.9:bond0-1.1.5:bond0>
   ACTIVE  MTU:1500  Priority:10  Tolerance:1500 ms  Window:50 packets
   RX packets:136 fragments:28/0 bundles:0/0
   TX packets:353 fragments:0/0 bundles:0/0
   TX profile sample:22 packets  average:182 octets
   0-64:45% -256:36% -1024:14% -4096:5% -16354:0% -32768:0% -66000:0%
   RX states:355 probes:169 naks:0 defs:15 dups:1
   TX states:350 probes:175 naks:9 acks:0 dups:0
   Congestion bearer:0 link:0  Send queue max:16 avg:0

Since the publications was never passed to the topology server on 1.1.9, 
subsequent withdrawal messages from 1.1.5  will then generate errors on 
1.1.9, e.g:
TIPC: Unable to remove publication by node 0x1001005
(type=1115241, lower=835, ref=2035813093, key=2035813094)

(Continue reading)


Gmane