Saturday, August 11, 2007

ORA-19511: VERITAS NetBackup: Status 25: Cannot connect on socket

Status 25: "Cannot connect on socket"

Exact Error Message
25 cannot connect on socket

Details:
Overview:
When performing backups or restores, socket errors are being produced.

Troubleshooting:
Please follow all steps within the VERITAS NetBackup (tm) Troubleshooting Guide or the NetBackup Troubleshooter within the Activity Monitor for this status code before continuing.

Please confirm hardware and software compatibility before continuing. A list of compatible hardware and software may be obtained within the VERITAS NetBackup Release Notes or on the VERITAS Support Web site.

If the above does not resolve the issue, please continue.


1. Status code 25 caused by no open sockets on the Master/Media Server

Please look for the following log messages:

Master Log Files:

Media Server Log Files:
bpbrm:
02:44:21.877 [1092.2944] <2> nb_bind_on_port_addr: port 4998 unavailable
02:44:21.877 [1092.2944] <2> nb_getsockconnected: cannot connect() to MasterServer, Only one usage of each socket address (protocol/network address/port) is normally permitted.
02:44:21.877 [1092.2944] <2> db_begin: nb_getsockconnected() failed: Only one usage of each socket address (protocol/network address/port) is normally permitted. (10048)
02:44:21.877 [1092.2944] <2> db_FLISTsend: db_begin() failed: cannot connect on socket
02:44:21.877 [1092.2944] <16> bpbrm handle_backup: db_FLISTsend failed: cannot connect on socket (25)
02:44:21.877 [1092.2944] <2> nb_bind_on_port_addr: port 4999 unavailable

Client Log Files:

Resolution:
A socket in a TIME_WAIT state may be made available sooner by reducing the tcp_time_wait_interval parameter. The default for this parameter on Solaris is 240000 milliseconds or 4 minutes. The default for Hewlett-Packard (HP) is a value of 60000 milliseconds or 1 minute. The default on AIX is set in increments of 15 seconds. The default value is 1 or 15 seconds. On AIX, this is the lowest it may be set to. VERITAS has found that the tcp_time_wait_interval can often be set to 1 second with no adverse effects. However, customers should adjust the TCP_TIME_WAIT_INTERVAL in increments; lowering the value until backups no longer fail with the status 25.

The value of this parameter may be obtained and changed by running the following command:

For Solaris 2.6 or earlier:

#ndd -get /dev/tcp tcp_close_wait_interval
#240000
#ndd -set /dev/tcp tcp_close_wait_interval 10000 (10 seconds)


For Solaris 7 or later:

#ndd -get /dev/tcp tcp_time_wait_interval
#240000
#ndd -set /dev/tcp tcp_time_wait_interval 10000 (10 seconds)


The ndd command makes the change immediately, without a need for a reboot. This setting will go back to default after a reboot. To make the change permanent, the command can be added to the appropriate TCP/IP startup script. On Solaris, this is /etc/rc2.d/S69inet.

For HP-UX 11, use the following command:

#ndd -get /dev/tcp tcp_time_wait_interval
#60000
#ndd -set /dev/tcp tcp_time_wait_interval 10000 (10 seconds)

Note: The equivalent command on HP-UX 10.x is nettune instead of ndd.

The ndd command makes the change immediately, without a need for a reboot. This setting will go back to default after a reboot. To make the change permanent, the command can be added to the appropriate TCP/IP startup script. On HP-UX 11, see /etc/rc.config.d/nddconf for examples on how to set it.

It is not necessary to change the tcp_timewait default parameter on AIX, but for reference:

To display the value, enter:
#no -a
or
#no -o tcp_timewait

To change the value, enter:
#no -o tcp_timewait=1

Where is a positive integer, with each increment representing 15 seconds. The default is 1 for 15 seconds. The change takes effect immediately. The change is effective until the next boot. A permanent change is made by adding the command to /etc/rc.net.

Windows System:
Ports in TIME_WAIT will be seen by running netstat -a from a command prompt on the master or media server. This can be an indication that ports are not being released quickly enough to allow another connection on that port. The registry setting TcpTimedWaitDelay can be added to reduce the time a port will stay in this state.

1. Go to Start | Run, type regedit, and click OK
2. Navigate to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
3. Highlight the Parameters key, right-click and select New | DWORD Value
4. Type the name TcpTimedWaitDelay
5. Double-click TcpTimedWaitDelay and enter a decimal value of 30
By default, this value is 240 seconds (4 minutes). It is recommended that this be changed to a value between 30-60 (seconds) in decimal, to decrease the wait time before a port becomes available.

Description:
This parameter determines the length of time that a connection will stay in the TIME_WAIT state when being closed. While a connection is in the TIME_WAIT state, the socket pair cannot be re-used. This is also known as the "2MSL" state, as by RFC the value should be twice the maximum segment lifetime on the network. See RFC793 for further details.

IBM: TSM (ADSM) back-up - TCP/IP connection failure

IBM: TSM (ADSM) back-up - TCP/IP connection failure

J1gh2 (MIS)
11 Aug 04 7:15
Hi folks

For a few months now a level 0 backup fails about once a week with "ANS1017E (RC-50) Session rejected: TCP/IP connection failure". This error appears in the database log file but there is nothing in the dsmerror.log or dsmsched.log.

The TSM server is in a heathy state in all other respects and all other scheduled backups do get completed after this . We are running TSM 5.1 on AIX 5.2. We also run TSM 4.2 on AIX 4.3 but this does not exhibit this problem.

Since it happens intermittently, it`s difficult to set a trace. The workload on the server is not particulary challending and topas shows ample idle time.

I would appreciate any help and suggestions. Thanks a lot

LED888 (TechnicalUser)
12 Aug 04 22:17

Session rejected: TCP/IP connection failure [Same as ANS1017E]
This is what the client sees and reports, but has no idea why.
The cause is best sought in the ADSM server Activity Log for that time.
Could be a real datacomm problem; or...
Grossest problem: the TSM server is down.

If you get this condition after supposedly changing the client and server to use a different port number (e.g., 1502), and the Activity Log has no significant information about the attempted session, use 'netstat' or 'lsof' or similar utility in the server operating system to
verify that the *SM server is actually serving the port number that you believe it should be. (You *did* code the port numbers into both the client and server options files, right?)
An administrator may have done a 'CANcel SEssion'.
If during a Backup, likely the server cancelling it due to higher priority task like DB Backup starting and needing a tape drive...particularly when there is a drive shortage. Look in the server Activity Log around that time and you will likely see "ANR0492I All drives in use. Session 22668 for node ________ (AIX) being preempted by higher priority operation.".

Or look in the Activity Log for a "ANR0481W Session NNN for node () terminated - client did not respond within NN seconds." message, which reflects a server COMMTimeout value that is too low.

Message "ANR0482W Session for name () terminated - idle for more than N minutes." is telling you that the sever IDLETimeout value is too low. Remember that longstanding clients may take considerable time to rummage around in their file systems looking for new files to back up.

Another problem is in starting your client scheduler process from
/etc/inittab, but failing to specify redirection - you need:
dsmc::once:/usr/bin/dsmc sched > /dev/null 2>&1 # TSM scheduler

An unusual cause is in having the client and server defined to use the same port number!

Might also be a firewall rejecting the TSM client as it tries to reach the server through that firewall.


J1gh2 (MIS)
13 Aug 04 5:03
Thanks a lot, LED888

I will follow up all the leads that you have given me and let you know.

Cheers
J1gh2 (MIS)
17 Aug 04 8:29
Hi LED888

I did indeed find ""ANR0492I All drives in use..." in the activity log at the time of the TCP/IP connection failure. We are now installing more drives...

Thanks a lot for your help
LED888 (TechnicalUser)
19 Aug 04 20:37
That's great news!