Tuesday, July 31, 2012

Replacing a disk in zpool

Replacing a disk in zpool using a new disk.



(PhyHost1:/)# zpool status zone1-pool
  pool: zone1-pool
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: resilver completed after 0h4m with 0 errors on Mon Apr 16 21:02:41 2012
config:

        NAME                        STATE     READ WRITE CKSUM
        zone1-pool                  ONLINE       0     0     0
          c2t60050974071C599Cd84s2  ONLINE       0     0     0  4.80G resilvered

errors: No known data errors
(PhyHost1:/)#


New disk

(PhyHost1:/)# ls -l /dev/rdsk/c2t60050974071C599Cd165s2
lrwxrwxrwx   1 root     root          76 Apr 18 16:11 /dev/rdsk/c2t60050974071C599Cd165s2 -> ../../devices/pci@fc,600000/SUNW,qlc@1/fp@0,0/ssd@w50000974081b599c,a5:c,raw
(PhyHost1:/)# /etc/vx/diag.d/vxdmpinq /dev/rdsk/c2t60050974071C599Cd165s2

Inquiry for /dev/rdsk/c2t60050974071C599Cd165s2, evpd 0x0, page code 0x0
        Vendor id                        : EMC
        Product id                       : SYMMETRIX
        Revision                         : 5874
        Serial Number                    : 50999000@
(PhyHost1:/)#
(PhyHost1:/)# zpool status | grep c2t60050974071C599Cd165
(PhyHost1:/)#
(PhyHost1:/)# vxdisk -o alldgs -e list | grep c2t60050974071C599Cd165
(PhyHost1:/)# vxdisk -o alldgs -e list | grep c2t60050974071C599Cd165s2
(PhyHost1:/)# vxdisk -o alldgs -e list | grep c5t50000974081B59A0d165s2
(PhyHost1:/)#


(PhyHost1:/)# zpool replace zone1-pool c2t60050974071C599Cd84s2 c2t60050974071C599Cd165s2

(PhyHost1:/)#




(PhyHost1:/)# zpool status zone1-pool
  pool: zone1-pool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h2m, 57.92% done, 0h1m to go
config:

        NAME                           STATE     READ WRITE CKSUM
        zone1-pool                     ONLINE       0     0     0
          replacing-0                  ONLINE       0     0     0
            c2t60050974071C599Cd84s2   ONLINE       0     0     0
            c2t60050974071C599Cd165s2  ONLINE       0     0     0  2.78G resilvered

errors: No known data errors
(PhyHost1:/)#
(PhyHost1:/)# zpool status zone1-pool
  pool: zone1-pool
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: resilver completed after 0h4m with 0 errors on Wed Apr 18 16:46:19 2012
config:

        NAME                         STATE     READ WRITE CKSUM
        zone1-pool                   ONLINE       0     0     0
          c2t60050974071C599Cd165s2  ONLINE       0     0     0  4.80G resilvered

errors: No known data errors
(PhyHost1:/)#

The old disk will be automatically removed from the zpool

(PhyHost1:/)#
(PhyHost1:/)# zpool status zone1-pool
  pool: zone1-pool
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: resilver completed after 0h4m with 0 errors on Wed Apr 18 16:46:19 2012
config:

        NAME                         STATE     READ WRITE CKSUM
        zone1-pool                   ONLINE       0     0     0
          c2t60050974071C599Cd165s2  ONLINE       0     0     0  4.80G resilvered

errors: No known data errors
(PhyHost1:/)#
(PhyHost1:/)#


Old disk can be released from system and recuperated.

Tuesday, July 10, 2012

To understand why netstat -a show connections in the BOUND state



Why does netstat -a show connections in the BOUND state? 
The Port in BOUND status and is not accepting connection.

(SlarisHost1:/)# netstat -a | grep 34749
SlarisHost1.46237          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
SlarisHost1.46257          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
SlarisHost1.46274          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
SlarisHost1.46282          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
      *.34749              *.*                0      0 49152      0 BOUND
SlarisHost1.35258          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
SlarisHost1.33680          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
SlarisHost1.44257          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
SlarisHost1.36826          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
SlarisHost1.33395          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
(SlarisHost1:/)#

Pfiles reveals the process using this port

Port 34749 is used by pid 44968
Port 34749 is used by pid 48671
....


After restarting the application processess, port was liberated

(SlarisHost1:/)# netstat -a | grep 34749
SlarisHost1.46237          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
SlarisHost1.46274          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
SlarisHost1.46282          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
SlarisHost1.58151          SlarisHost1.34749          49152      0 49152      0 ESTABLISHED
SlarisHost1.34749          SlarisHost1.58151          49152      0 50498      0 ESTABLISHED
SlarisHost1.58173          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
SlarisHost1.33680          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
SlarisHost1.34749                *.*                0      0 49152      0 LISTEN
SlarisHost1.36826          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
SlarisHost1.47314          SlarisHost1.34749          49152      0 49152      0 ESTABLISHED
SlarisHost1.34749          SlarisHost1.47314          49152      0 49160      0 ESTABLISHED
SlarisHost1.47326          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
SlarisHost1.33395          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
SlarisHost1.53352          SlarisHost12.34749       49640      0 49640      0 ESTABLISHED
(SlarisHost1:/)#



From Oracle Support:

Why does netstat -a show connections in the BOUND state?

The BOUND state is the state a socket shows after it is created and the 'bind()' call is made, but none of the 'listen()', 'accept()', 'connect()' or 'close()' calls have been made.  The confusion is that it is not a TCP state, it is a socket state, but it appears in the field that 'netstat' usually uses for TCP state info.

BOUND is a transitory state and the application should be doing a 'listen()'
right after 'bind()' succeeds, then wait in 'accept()' for incoming data.

A listening server process - a client would typically 'connect()' - will 'bind()' automatically.

Sometimes this state is shown after closing or killing an application.  Either situation is likely to be a problem in the way application was implemented and some adjustment is required in fixing the application code (socket programming).

If the application holds this socket open, it will prevent any other application to bind to the same TCP port number.  This can cause services to hang.  Having a BOUND state for a long period of time may cause the application to appear hung or be unresponsive.

Depending how the application works, a server could make multiple attempts to contact unavailable clients and this could result in many sockets left in a BOUND state, eventually resulting in exhausting the supply of available sockets. 
Technical Instruction Document: 1005979.1 discusses increasing the number of file descriptors to avoid the above problem.

If problems rebinding to ports are reported periodically, either when killing a daemon, or if the daemon closes a bound socket, and then creates a new one, the new socket cannot be rebound.  The process reports the following error:

    bind: Address already in use

Eventually, by killing daemons, the BOUND state goes away (kill -15 , kill -11 , kill -9 ).  However, the socket would be still bound with no process running at that time.

There is no way to free the bound ports unless the processes that have bound the socket are killed and that does not always work.  A reboot is sometimes in order.

Notice that when the application opens a socket connection it has complete  control of the socket until it releases it and that socket connection shows TIME_WAIT in the 'netstat -an' output.

To try to identify the process, use the 'pfiles' command (Solaris[TM] 8 and above). Prior to Solaris{TM] 8, the 'lsof' public domain application may be used on the system.
A possible workaround while troubleshooting is to define another socket in /etc/services.  For example:

    service-name1        5010/tcp
    service-name2        5011/tcp

When appropriate, 'truss' the application while issuing the 'kill' or 'kill
-9' commands to get more info as to why it is not closing correctly.

Example of a successful attempt of identification and solution:

  % netstat -an | grep BOUND
        *.33330              *.*                0      0 24576      0 BOUND
        *.33330              *.*                0      0 24576      0 BOUND 
          
  % 
  % su    ( type root password )
  # cd /proc ; pfiles * | egrep "^[0-9]|sockname" > /var/tmp/pfiles1.txt
  # vi /var/tmp/pfiles1.txt

 

  814:    /bin/sh -c dtfile -noview
  815:    dtfile -noview
  819:    cachefsd
  856:    java_vm
          sockname: AF_UNIX
          sockname: AF_UNIX
          sockname: AF_UNIX
          sockname: AF_UNIX
          sockname: AF_UNIX
          sockname: AF_UNIX
          sockname: AF_INET6 ::  port: 33330

  # ps -ef | grep 856
      demo   856   811  0   Jan 25 pts/12   2:46 java_vm 
      root  6619  6400  0 13:58:10 pts/12   0:00 grep 856
  #
  # kill -15 856
  # 
  # ps -ef | grep 856
      root  6630  6400  0 13:59:13 pts/12   0:00 grep 856
  #
  # netstat -an | grep BOUND
  #