!!! UNIX !!!: September 2010

Thursday, September 30, 2010

vxconfigd - Volume Manager configuration daemon

The Volume Manager configuration daemon, vxconfigd, is responsible for maintaining configurations of disks and disk groups in the VERITAS Volume Manager.  vxconfigd takes requests from other utilities for configuration changes, and communicates those changes to the kernel and modifies configuration information stored on disk. vxconfigd is also responsible for initializing the Volume Manager when the system is booted. 

vxconfigd -k is used if the daemon is hung. This will kill the existing daemon and restarts.   Killing the old vxconfigd and  starting a new one should not cause any problems for volume or plex devices that are being  used by applications or that contain mounted file systems.

Wednesday, September 29, 2010

Introducing the newly allocated storage disk to your system.

New LUNs has been assigned to your host and you want to use it. There are a series of steps involved to first find the allocated disk, make it available to veritas and finally put it into good use. 

The disks from SAN will be attached to a particular controller, to find the disk, run the cfgadm command

List the  state  and  condition  of  attachment  points 

# cfgadm -al

Now configure the fc-fabric controller

# cfgadm -c configure c1 

Recreate the device tree which also clears the unwanted devices

# devfsadm -C

Now the devices should be available under /dev/rdsk 

Tuesday, September 28, 2010

Beware of making changes when someone is monitoring you!!!

Changes to a system is a daily activity in a sysadmin's life. It cannot be avoided. And there are a lot of tools to help him do a change easily and without making mistakes. but still there is a need to be careful and be aware of these tootl which monitors and helps you. Because these tools might be there to help you, but still if you ignore them, it might turn against you.

Always check if there is any monitoring process running before making any changes to a system. Particularly if there is an application like VCS which monitors and also takes corrective actions when something abnormal is noticed, be extra careful.

Consequences because of overlooking a tool like VCS.

VCS normally defines dependencies between applications. 

For example, an application will be dependent on a file system in-turn, the file system will be dependent on a disk group. Also an ip can be dependent on a file system in-turn a whole application may be dependent on that ip.

Making any changes to any of these components without the knowledge of VCS may lead to some bad things !!!

Let me explain a little about VCS first so that the below scenario can be more clear.

Veritas Cluster Server is a high availability solution from symantec. It monitors the resources(file-systems, dg, applications, ip, horc, etc) and has the ability to perform failovers in-case of failure on one system thus enabling the availability of the application. It is one of the best ways to minimize application downtime. 

Practically, resources are configured in VCS and monitoring is enabled. Also dependencies are specified so that we could ensure that one particular resource cannot exists without another necessary resources. VCS has agents for each resources which monitors the status of the associated resources and can take appropriate actions like online/offline as per the rules specified.

The Scenario:

There is a filesystem /opt/MyApps/Billlogs on a machine name Server1. The filesystem is configured in VCS with some dependencies. ip-MyApps2 which is an ip resources of MyApps2 is dependent on this filesystem

fsSubApp1   SubApp2

   |        |

   |        |

   ip-MyApps2                         (parent)

         |

         |

         |

 /opt/MyApps/Billlogs                 (child)

         |

         |

    DG-MyApps

In the above configuration, the file system /opt/MyApps/Billlogs is necessary for all the resources dependent on it like ip-MyApps2, fsSubApp1 and SubApp2. If the filesystem /opt/MyApps/Billlogs is unmounted, it causes a cascading effect by pulling down the resources that depends on it.

VCS continually monitors each and every types of resources. If a particular resource is taken offline, the dependent resources are also appropriately dealt with. So if VCS is enabled and running, and the resources are monitored by VCS, off-lining or on-lining a resource without the knowledge of VCS might lead to un-foreseen impacts. For example, if the file system /opt/MyApps/Billlogs is unmounted through the system, the VCS thinks something has gone wrong so will try to take the dependent resources offline. 

But in-case if the dependency is wrongly specified while configuring, ie, if there is in fact no dependency of ip-MyApps2 on /opt/MyApps/Billlogs and still if the dependency has been specified, this could lead to downtime of ip-MyApps2, which is not expected, but had happened because of the dependency. This is a mistake that happens because of improper configuration of dependency and also the off-lining activity done without  consultation with VCS.

So how to do it? 

The safe way to remove a resource without affecting the already running other dependent applications is to first unlink them so that no dependency is established between the resources and then safely turn off the resources that is no longer required.

hagrp -unlink parent_group child_group

Friday, September 24, 2010

VxVM - Plex State Change Cycle

Changing plex states are part of normal operations, and do not necessarily indicate abnormalities that must be corrected. 

All CLEAN(DISABLED) plexes are made ACTIVE(ENABLED) when system starts or when volume started.

EMPTY=>CLEAN=>ACTIVE=>OFFLINE/IOFAIL

OFFLINE=>STALE=>ACTIVE

IOFAIL=>ACTIVE

Veritas Volume Manager - Plexes

In Veritas Volume Management, disk spaces is allocated as sub-disks, plexes and eventually volumes. Contiguous disk blocks are grouped as sub-disks, which is a portion on the Vx Disk. 

Plex is a group of sub-disks. Plex can be organized as stripes, mirror and RAIDs. Plexes are used to form volumes.

'vxassist' command automatically creates plexes while creating volumes. Plex can also be created seperately with 'vxmake' command and attached to a volume later.

'vxprint' command is used to display plex information. (vxprint -g  -l plex)

Plex States:

VxVM maintains the state of plex automatically. There are many state associated with plex which helps to identify the consistency of data. These states are very important for the recovery of volume after a system failure.

ACTIVE State:
This state shows the plex is in use and I/O operation is happening on the plex.

CLEAN State:

If a plex has consistent data, this state is set.

EMPTY State:

This is set when a new volume is created and plex is not initialized. 

OFFLINE State:

Plex is not associated with a volume.

Plex Condition Flags:

NODEVICE:

The physical disk associated with sub-disk of plex is not available. Recovery has to be done to be able to use the plex again.

RECOVER:

The physical disk associated with plex is reattached but but is not in sync with volume and recovery is needed.

REMOVED:

The sub-disk associated with a plex is lost. Complete recover of the sub-disk is needed.

Plex Kernal State: This indicates if plex is accessible for volume driver. maintained internally, state change is reflected automatically. 

DETACHED:

Plex is in maintenance state. No write access is allowed on plex.

DISABLED:

Plex is not accessible.

ENABLED:

Plex is online and read/write access is accepted.

Wednesday, September 22, 2010

Large File Support

Many operating system were designed with restricted file size support when they were initially developed. As the disk capacity and the processing capacity increased, file sizes started growing resulting in files over the size of 2GB and 4GB's. So the Operating Systems which had initially not taken this growth into consideration had to separately provide facility for processing large files. 

Large file support can be enabled for a file system while creating them or after creating them. While mounting a file system, there is an option to specify large file option which checks if the underlying file system has that support enabled or not.

There are options to switch between largefiles and nolargefiles. But if a file system has largefile support and has a large file, converting it to nolargefile will result in mount failure.

fsadm allows to specify laregfile support. An example for specifying laregfile option using fsadm in hp-ux,

root@Server1:/hroot# fsadm -F vxfs -o largefiles /base/files

root@Server1:/hroot# umount /base/files

root@Server1:/hroot# mount /base/files

root@Server1:/hroot# mount | grep /base/files

/base/files on /dev/vg_base/files ioerror=mwdisable,largefiles,delaylog,dev=402f0009 on Thu Jul 15 16:32:19 2010

Saturday, September 18, 2010

A situation araises in which the admin is supposed to make changes to a particular file in a large number of servers, say about 500 servers. opening and editing the file manually takes ages and it doesn't make any sense particularly when there are tools available such things.

One such powerful tool is 'sed'. sed stands for Stream Editor and it ships with almost all unix flavours. It requires very minimal reaources to run. But it is rarely used as a common editor because it has a very difficult interface.

sed reads its input from standard input one line at a time.
sed uses its editing command on the input stream
sed sends the o/p to the standard output and can be redirected

Lets see with an example

Server1:/home/Aaron# cat file_1
line 1 This is a test file
line 2 We will use this file to test sed
line 3
line 4
line 5
line 6
line 7
line 8
line 9
line 10

Server1:/home/Aaron# sed -e 's/line/LINE/g' file_1
LINE 1 This is a test file
LINE 2 We will use this file to test sed
LINE 3
LINE 4
LINE 5
LINE 6
LINE 7
LINE 8
LINE 9
LINE 10
Server1:/home/Aaron#

This is what has happened:
sed reads the standard input into the pattern space, performs a sequence of editing commands(here substitution of line with LINE) on the pattern space, then writes the pattern space to STDOUT.
Note: The original file is unharmed.

There are a lot of such commands associated with sed. It is one of the powerful utility. Lets have a look at the most frequently used commands.

Lets see more about the commands later.

Friday, September 17, 2010

SAN Storage

SAN (storage area network) storage is a type of computer data storage system designed specifically for use with large networks. It is very expensive but is very reliable, scalable and flexible. It is networ based in which the storage box is connected to the server through switches. SAN supports RAID technologies which provide variopus ways to optimize data. It provides ways for high data availability, faster access, protection from disk failures and faster recovery. These advantages come with high cost.

SANs are most commonly implemented using a technology called Fibre channel. RAID supports high-performance data communications technology that supports very fast data rates (over 2Gbps).

A SAN presents shared pools of storage devices to multiple servers. Each server can access the storage as if it were directly attached to that server. SANs make it possible to move data between various storage devices, share data between multiple servers, and backup and restore data rapidly and efficiently.

A simple illustration of How SAN and server interacts(www.vmware.com):

How they communicate?

Host sends embedded access request to SAN. HBA and Switches acts as the medium through which the request is send.

The request reaches the Storage Processors which is the front interface of the SAN and communicates with the disk arrays eventually communicating with the LUNs.

Storage Devices(Disk arrays) uses RAID groups the disks and provides various functionality. The smallest unit of storage is LUN.

When provisioning storage, the administrator uses management software to create LUNs. They can create, for example, more than one LUN from one physical drive, which would then appear as two or more discrete drives to the user. Or they may create a number of LUNs that span several separate disks that form a RAID array; but, again, these will appear as discrete drives to users.

A given host might be able to access a LUN on a storage array through more than one path. Having more than one path from a host to a LUN is called multipathing.

LUNs can be shared between several servers. While implementing failovers, LUNs can be moved from one host to another.

Zoning:

This is a way of providing access control within a SAN. In a physical SAN, LUNs may be shared across many hosts. By zoning, it is possible to logically group hosts and storage in a SAN. It provides a way of authorization. Only the authorized hosts can see the associated devices. Zoning lets you isolate a single server to a group of storage devices or a single storage device, or associate a grouping of multiple servers with one or more storage devices, as might be needed in a server cluster deployment.

LUN Masking:

This is used to make a LUN visible to some hosts and invisible to some other hosts. This is used to protect some LUNs from other servers which might harm them.

Storage management itself is a very big area in an IT Infrastructure Management. But for a System Admin, Storage is an indispensable area and needs a really good understanding of the way storage works with servers. Hope the above descriptions of SAN would have helped a little.

Tuesday, September 14, 2010

World Wide Name (WWN)

A World Wide Name, or WWN, is a 64-bit address used in Fibre Channel networks to uniquely identify each element in a Fibre Channel network. It is similar to MAC address for the Network Interfaces. 

You can use "fcinfo hba-port" command in Solaris 10 to get information of WWN. 

Here you can see the HBA Port WWN (WWpN) and the Node WWN (WWnN) of the two ports on the installed Qlogic HAB card. This is also useful in finding the Model number, Firmwar version FCode, supported and current speeds and the port status of the HBA card/port.

Server1:/# fcinfo hba-port

HBA Port WWN: 210100e08bb467d8

        OS Device Name: /dev/cfg/c5

        Manufacturer: QLogic Corp.

        Model: 375-3363-xx

        Firmware Version: 03.03.28

        FCode/BIOS Version:  fcode: 1.13;

        Serial Number: not available

        Driver Name: qlc

        Driver Version: 20090519-2.31

        Type: N-port

        State: online

        Supported Speeds: 1Gb 2Gb

        Current Speed: 2Gb

        Node WWN: 200100e08bb467d8

Sunday, September 12, 2010

Sun SPARC Enterprise Servers

As a System Administrator, knowing the hardware which runs the Operating system is as important as knowing the Operating system.
SPARC is the architecture designed by sun for its range of servers. Sun has a variety of hardware specifications for the mid-range and high end servers.

T-series Model: T1000, T2000, T5120, T5140, T5220, T5240, T5440
M-Series Model: M3000, M4000, M5000, M8000, M9000

Of this the latest model is M9000 series:

The Sun SPARC Enterprise M9000 Servers is one of the most leading hardware platform which is known for its high performance and scalability. It supports upto 4TB Memory and 64 Processors and 256 cores. It's designed for demanding virtualization, consolidation, and multi-hosting deployments that require mission-critical RAS features.

These M class machines supports Dynamic Domains and Dynamic Reconfigurations, Which enables a single machine to be divided into multiple individual electronically isolated partitions. (Supports upto 24 such Domains)

For more specifications please refer
http://www.oracle.com/us/products/servers-storage/servers/index.html

Saturday, September 11, 2010

ssh login for root user

In an environment where there is a need to manage thousands of servers, ssh plays a very important role. The ssh utility enables system administrator to login to every server from a common server as root. This eliminates the need to type in root passwords for each and every server when logging in.
There are obviously restrictions enforced for ssh on root user that has to be tweaked.

The file /etc/ssh/sshd_config has the parameter for enabling or disabling ssh for root.

# Are root logins permitted using sshd.
# Note that sshd uses pam_authenticate(3PAM) so the root (or any other) user
# maybe denied access by a PAM module regardless of this setting.
# Valid options are yes, without-password, no.
PermitRootLogin yes

Restart the sshd after changing any configuration.

ssh login

ssh is a secure way to login in between 2 networked hosts. Primarily used in unix based system and is designed to replace telnet, rsh, rlogins etc. It is an encrypted alternative to other shell based logins which is very secure. It is based on public-key, private key authentication model.

How to enable ssh between 2 hosts and enable password less login.

1. First login to server1 as user aaron

login as: aaron
aaron@server1's password:
Last login: Fri Sep 10 15:19:14 2010 from 10.120.129.49
Sun Microsystems Inc. SunOS 5.8 Generic Patch December 2002
Welcome !! Aaron Schweitzer !!!
...............................
YOU ARE NOW LOGGED IN - Sat Sep 11 13:01:20 MEST 2010
server1:~ $
server1:~ $
server1:~ $cd .ssh
server1:~/.ssh $ls
known_hosts

2. Generate the ssh key with the ssh-keygen utility

server1:~/.ssh $ssh-keygen -t rsa -N ""
Generating public/private rsa key pair.

Enter file in which to save the key (/home/aaron/.ssh/id_rsa): Your identification has been saved in /home/aaron/.ssh/id_rsa.
Your public key has been saved in /home/aaron/.ssh/id_rsa.pub.
The key fingerprint is:
53:71:83:1c:4a:d7:95:72:a8:4f:19:74:62:70:42:9b aaron@server1
server1:~/.ssh $ls
id_rsa id_rsa.pub known_hosts //id_rsa.pub has public key that has to be shared with server2
server1:~/.ssh $cat id_rsa.pub
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAtkZQVO6qNTfj+LrD03GdoDe2A+H9vkjW0ojK+BRcRYt9DYDvB1PD7CwFlmB+qHO4u1URLNzmoW7oL6XYsJcO0JiEE1mIq14LXS/Elap/es2RoN+qwezcwwZVzXz6C1gt1ds01aiBKXatZY5+ndIC4o+HHLCaWRqZ+JUttha0Iak= aaron@server1
//This key has to be copied to a file named authorized_keys in .ssh directory in the user's home directory

3. Login to server2 as the same user and generate the ssh as same as above

server1:~/.ssh $ssh server2
Password:

Last login: Sat Sep 11 13:08:23 2010 from server1.mobile.

Sun Microsystems Inc. SunOS 5.10 Generic January 2005
Welcome !! Aaron Schweitzer !!!
...............................
YOU ARE NOW LOGGED IN - Saturday, September 11, 2010 1:08:23 PM MEST
server2:~ $
server2:~ $cd .ssh
server2:~/.ssh $ssh-keygen -t rsa -N ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/aaron/.ssh/id_rsa): y
Your identification has been saved in y.
Your public key has been saved in y.pub.
The key fingerprint is:
16:07:0b:e7:49:d0:49:89:fa:0e:e6:2e:4f:9e:b5:3d aaron@server2

server2:~/.ssh $cat y.pub

ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAuhSTr/UkTwOpsbjSzwHq89Zfd2fW/o45X/VH9QxFmWKMQpX3DAEQ0KeY1f+aM8NYNA675lNOtehXxahELSPy6DqRUbL5a9B2lIgHHhaG9dTxKRtwz4qxZYW6S7fT9HXPueHKQfyGjP0lqp2twFC7JOCH9wnOreDj9jPPjMI0hB8= aaron@server2

4. Copy the public key
id_rsa.pub(server1) and
y.pub(server2) and share paste it in file named
authorized_keys. ie, server1's public key should be in server2's authorized_key file and vise versa.
(Note in the below both keys are in same file because the home directory is shared from a fileserver)

server2:~/.ssh $vi authorized_keys
"authorized_keys" [New file]
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAtkZQVO6qNTfj+LrD03GdoDe2A+H9vkjW0ojK+BRcRYt9DYDvB1PD7CwFlmB+qHO4
u1URLNzmoW7oL6XYsJcO0JiEE1mIq14LXS/Elap/es2RoN+qwezcwwZVzXz6C1gt1ds01aiBKXatZY5+ndIC4o+HHLCaWRqZ+JUt
tha0Iak= aaron@server1
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAuhSTr/UkTwOpsbjSzwHq89Zfd2fW/o45X/VH9QxFmWKMQpX3DAEQ0KeY1f+aM8NY
NA675lNOtehXxahELSPy6DqRUbL5a9B2lIgHHhaG9dTxKRtwz4qxZYW6S7fT9HXPueHKQfyGjP0lqp2twFC7JOCH9wnOreDj9jPP
jMI0hB8= aaron@server2

5. Now the set up is complete. User aaron can use ssh to login from server1 to server2 and server2 to server1 without password.

server2:~/.ssh $ssh server1
The authenticity of host 'server1 (10.1.64.174)' can't be established.
RSA key fingerprint is c9:ae:b4:be:b7:f5:56:b1:e8:ef:18:31:97:d6:8c:05.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'server1,10.1.64.174' (RSA) to the list of known hosts.
Last login: Sat Sep 11 13:01:20 2010 from edmj625.nt.mobi
Sun Microsystems Inc. SunOS 5.8 Generic Patch December 2002
Welcome !! Aaron Schweitzer !!!
...............................
YOU ARE NOW LOGGED IN - Sat Sep 11 13:10:35 MEST 2010
server1:~ $ssh server2
Last login: Sat Sep 11 13:08:23 2010 from server1.mobile.
Sun Microsystems Inc. SunOS 5.10 Generic January 2005
Welcome !! Aaron Schweitzer !!!
...............................
YOU ARE NOW LOGGED IN - Saturday, September 11, 2010 1:10:48 PM MEST
server2:~ $

Friday, September 10, 2010

LOFS Mounts

A loop back file system is a feature that allows the creation of a virtual file system that acts as an alternate path to access an already mounted filesystem. Like a hard link to a mount point.

This feature use is very useful while implementing zones.

Consider the situation in which you have a global machine where oracle client is installed. The machine has say 5 zones and the requirement is to have oracle client to be installed on each zones.
Using the lofs concept we can make the oracle client available in the global machine to be accessible as installed locally from zones.

The zone available in this machine is:

server1:/root# zoneadm list -icv
76 zone1    running    /zones/zone1                native   shared

The file system /opt/oracle is in global machine server1 with oracle client installed:

server1:/opt/oracle# df -k .
Filesystem            kbytes    used   avail capacity Mounted on
/dev/vx/dsk/DG-LOCAL/VOL-LOCAL-opt-oracle
                     5242880 3186470 1927985    63%    /opt/oracle
server1:/opt/oracle# ls
admin       lost+found product

Lets check the zone configuration for the above zone:

server1:/opt/oracle# zonecfg -z zone1 info
zonename: zone1
zonepath: /zones/zone1
brand: native
autoboot: false
bootargs:
pool:
limitpriv:
scheduling-class:
ip-type: shared
[cpu-shares: 20]
fs:
        dir: /etc/globalname
        special: /etc/nodename
        raw not specified
        type: lofs
        options: [ro]
net:
        address: 10.120.198.22
        physical: bge0
        defrouter: 10.120.198.3
net:
        address: 10.3.90.230
        physical: nxge1
        defrouter not specified
rctl:
        name: zone.cpu-shares
        value: (priv=privileged,limit=20,action=none)

Configure the lofs fs on the zone:

server1:/opt/oracle# zonecfg -z zone1
zonecfg:zone1> add fs
zonecfg:zone1:fs> set dir=/opt/oracle
zonecfg:zone1:fs> set special=/opt/oracle
zonecfg:zone1:fs> set type=lofs
zonecfg:zone1:fs> end
zonecfg:zone1> verify
zonecfg:zone1> commit
zonecfg:zone1> info
zonename: zone1
zonepath: /zones/zone1
brand: native
autoboot: false
bootargs:
pool:
limitpriv:
scheduling-class:
ip-type: shared
[cpu-shares: 20]
fs:
        dir: /etc/globalname
        special: /etc/nodename
        raw not specified
        type: lofs
        options: [ro]
fs:
        dir: /opt/oracle
        special: /opt/oracle
        raw not specified
        type: lofs
        options: []
net:
        address: 10.120.198.22
        physical: bge0
        defrouter: 10.120.198.3
net:
        address: 10.3.90.230
        physical: nxge1
        defrouter not specified
rctl:
        name: zone.cpu-shares
        value: (priv=privileged,limit=20,action=none)
zonecfg:zone1> exit

Mount the lofs file system onto the correct path. The below part will work even without configuring in zonecfg. But we configure on zonecfg to make the fs permanent so that when we startup the zone, this mount happens automatically(like vfstab)

server1:/# mount -F lofs /opt/oracle /zones/zone1/root/opt/oracle
mount: Mount point /zones/zone1/root/opt/oracle does not exist.
server1:/#
server1:/# mkdir /zones/zone1/root/opt/oracle
server1:/#
server1:/# mount -F lofs /opt/oracle /zones/zone1/root/opt/oracle
server1:/#
server1:/# ls -ld /opt/oracle
drwxr-xr-x   5 oracle   dba           96 Aug 22 2008 /opt/oracle
server1:/#
server1:/# ls -ld /zones/zone1/root/opt/oracle
drwxr-xr-x   5 oracle   dba           96 Aug 22 2008 /zones/zone1/root/opt/oracle

Now the /opt/oracle file system is available in the zone. To check zlogin to the zone1 and try accessing /opt/oracle. /opt/oracle

The above example showed how to configure lofs in zone configuration and mount the file system as lofs.

This way we have shared the oracle client installed on the global machine accessible on the zone there by saving additional disk space needed if the oracle client is be installed seperately inside the zone.

Tuesday, September 7, 2010

VxVM - Recovering Volumes after a Disk Failure !!!

Situation:- Host has lost connection with the SAN box which resulted in disk not available to the dg which inturns affects the mounts.

To Recover and start a Veritas Volume Manager logical volume where the volume is DISABLED ACTIVE and state of Plex is DISABLED NODEVICE. When a system encounters a problem with a volume or a plex, or if Veritas Volume Manager (VxVM) has any reason to believe that the data is not synchronized, VxVM changes the kernel state, KSTATE and state, STATE, of the volume and its plexes accordingly.

The plex state can be stale, empty, nodevice, etc. A particular plex state does not necessarily mean that the data is good or bad.
The plex state is representative of VxVM's perception of the data in a plex.

vxprint displays information from records in VxVM disk group configurations, including the KSTATE and STATE of a volume and plex. When viewing the configuration records of a VxVM disk group using the vxprint utility and the KSTATE and STATE fields display DISABLED ACTIVE for the volume and DISABLED RECOVER for the plex, recovery steps need to be followed to bring the volume back to an ENABLED ACTIVE state so it can be mounted and make the file system accessible again.

Below are the steps to follow:-
1. Check the dg; If the status is disabled, deport and import the dg

(server1:/)# vxdg list
NAME STATE ID
MyDG-app enabled 1232625005.170.server1
2. Check the dg details using vxprint utility check the volumes
(server1:/)# vxprint -g MyDG-app -v

TY NAME ASSOC KSTATE LENGTH PLOFFS STATE TUTIL0 PUTIL0
v usrapp fsgen DISABLED 2097152 - ACTIVE - -
v usrappPS4 fsgen DISABLED 25165824 - ACTIVE - -
v usrappSMD fsgen DISABLED 3145728 - ACTIVE - -
v usrappput fsgen DISABLED 10485760 - ACTIVE - -
v MyDG-swap fsgen ENABLED 62914560 - ACTIVE - -
3. Try starting the volumes
(server1:/)# vxvol -g MyDG-app startall
VxVM vxvol ERROR V-5-1-1201 Volume usrapp has no associated data plexes
VxVM vxvol ERROR V-5-1-1201 Volume usrappPS4 has no associated data plexes
VxVM vxvol ERROR V-5-1-1201 Volume usrappSMD has no associated data plexes
VxVM vxvol ERROR V-5-1-1201 Volume usrappput has no associated data plexes
4. Check the vg details using vxprint -htg utility
(server1:/)#vxprint -htg MyDG-app

DG NAME NCONFIG NLOG MINORS GROUP-ID
ST NAME STATE DM_CNT SPARE_CNT APPVOL_CNT
DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE
RV NAME RLINK_CNT KSTATE STATE PRIMARY DATAVOLS SRL
RL NAME RVG KSTATE STATE REM_HOST REM_DG REM_RLNK
CO NAME CACHEVOL KSTATE STATE
VT NAME NVOLUME KSTATE STATE
V NAME RVG/VSET/CO KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
SV NAME PLEX VOLNAME NVOLLAYR LENGTH [COL/]OFF AM/NM MODE
SC NAME PLEX CACHE DISKOFFS LENGTH [COL/]OFF DEVICE MODE
DC NAME PARENTVOL LOGVOL
SP NAME SNAPVOL DCO

dg MyDG-app default default 76000 1226671526.111.serverold

dm EMC0_0 - - - - NODEVICE
dm EMC0_1 - - - - NODEVICE
dm EMC0_2 - - - - NODEVICE
dm EMC0_4 - - - - NODEVICE
dm EMC0_5 - - - - NODEVICE
dm EMC0_16 - - - - NODEVICE
dm EMC0_17 - - - - NODEVICE
dm EMC0_18 - - - - NODEVICE
dm EMC0_19 - - - - NODEVICE
dm EMC1_0 EMC1_0 auto 3583 70707840 -

v appPS4 - DISABLED ACTIVE 12582912 SELECT - fsgen
pl appPS4-01 appPS4 DISABLED NODEVICE 12582912 CONCAT - RW
sd EMC0_4-01 appPS4-01 EMC0_4 0 12582912 0 - NDEV

v appPS4archreorg - DISABLED ACTIVE 10485760 SELECT - fsgen
pl appPS4archreorg-01 appPS4archreorg DISABLED NODEVICE 10485760 CONCAT - RW
sd EMC0_16-01 appPS4archreorg-01 EMC0_16 0 10485760 0 - NDEV

v appPS4mlogA - DISABLED ACTIVE 1048576 SELECT - fsgen
pl appPS4mlogA-01 appPS4mlogA DISABLED NODEVICE 1048576 CONCAT - RW
sd EMC0_18-01 appPS4mlogA-01 EMC0_18 0 1048576 0 - NDEV

v appPS4mlogB - DISABLED ACTIVE 1048576 SELECT - fsgen
pl appPS4mlogB-01 appPS4mlogB DISABLED NODEVICE 1048576 CONCAT - RW
sd EMC0_19-01 appPS4mlogB-01 EMC0_19 0 1048576 0 - NDEV

v appPS4ologA - DISABLED ACTIVE 1048576 SELECT - fsgen
pl appPS4ologA-01 appPS4ologA DISABLED NODEVICE 1048576 CONCAT - RW
sd EMC0_1-02 appPS4ologA-01 EMC0_1 1048576 1048576 0 - NDEV

v appPS4ologB - DISABLED ACTIVE 1048576 SELECT - fsgen
pl appPS4ologB-01 appPS4ologB DISABLED NODEVICE 1048576 CONCAT - RW
sd EMC0_18-02 appPS4ologB-01 EMC0_18 1048576 1048576 0 - NDEV

v appPS4appdata1 - DISABLED ACTIVE 241172480 SELECT - fsgen
pl appPS4appdata1-01 appPS4appdata1 DISABLED NODEVICE 241172480 CONCAT - RW
sd EMC0_0-02 appPS4appdata1-01 EMC0_0 4194304 66513536 0 - NDEV
sd EMC0_1-03 appPS4appdata1-01 EMC0_1 2097152 68610688 66513536 - NDEV
sd EMC0_16-02 appPS4appdata1-01 EMC0_16 10485760 37437568 135124224 - NDEV
sd EMC0_18-03 appPS4appdata1-01 EMC0_18 2097152 68610688 172561792 - NDEV

v appPS410264 - DISABLED ACTIVE 16777216 SELECT - fsgen
pl appPS410264-01 appPS410264 DISABLED NODEVICE 16777216 CONCAT - RW
sd EMC0_5-01 appPS410264-01 EMC0_5 0 16777216 0 - NDEV

v appcle - DISABLED ACTIVE 4194304 SELECT - fsgen
pl appcle-01 appcle DISABLED NODEVICE 4194304 CONCAT - RW
sd EMC0_0-01 appcle-01 EMC0_0 0 4194304 0 - NDEV

v appclient - DISABLED ACTIVE 1048576 SELECT - fsgen
pl appclient-01 appclient DISABLED NODEVICE 1048576 CONCAT - RW
sd EMC0_1-01 appclient-01 EMC0_1 0 1048576 0 - NDEV

v appstage102 - DISABLED ACTIVE 20971520 SELECT - fsgen
pl appstage102-01 appstage102 DISABLED NODEVICE 20971520 CONCAT - RW
sd EMC0_2-01 appstage102-01 EMC0_2 0 20971520 0 - NDEV

v apptemp - DISABLED ACTIVE 52428800 SELECT - fsgen
pl apptemp-01 apptemp DISABLED NODEVICE 52428800 CONCAT - RW
sd EMC0_17-01 apptemp-01 EMC0_17 0 52428800 0 - NDEV

v appmntPS4 - DISABLED ACTIVE 10485760 SELECT - fsgen
pl appmntPS4-01 appmntPS4 DISABLED NODEVICE 10485760 CONCAT - RW
sd EMC0_5-02 appmntPS4-01 EMC0_5 16777216 10485760 0 - NDEV

v apptemp - DISABLED ACTIVE 83886080 SELECT - fsgen
pl apptemp-01 apptemp DISABLED NODEVICE 83886080 CONCAT - RW
sd EMC0_4-02 apptemp-01 EMC0_4 12582912 58124928 0 - NDEV
sd EMC0_19-02 apptemp-01 EMC0_19 1048576 25761152 58124928 - NDEV

v usrapp - DISABLED ACTIVE 2097152 SELECT - fsgen
pl usrapp-01 usrapp DISABLED NODEVICE 2097152 CONCAT - RW
sd EMC0_2-02 usrapp-01 EMC0_2 20971520 2097152 0 - NDEV

v usrappPS4 - DISABLED ACTIVE 25165824 SELECT - fsgen
pl usrappPS4-01 usrappPS4 DISABLED NODEVICE 25165824 CONCAT - RW
sd EMC0_2-03 usrappPS4-01 EMC0_2 23068672 25165824 0 - NDEV

v usrappSMD - DISABLED ACTIVE 3145728 SELECT - fsgen
pl usrappSMD-01 usrappSMD DISABLED NODEVICE 3145728 CONCAT - RW
sd EMC0_19-03 usrappSMD-01 EMC0_19 26809728 3145728 0 - NDEV

v usrappput - DISABLED ACTIVE 10485760 SELECT - fsgen
pl usrappput-01 usrappput DISABLED NODEVICE 10485760 CONCAT - RW
sd EMC0_5-04 usrappput-01 EMC0_5 28311552 10485760 0 - NDEV

v MyDG-swap - ENABLED ACTIVE 62914560 SELECT - fsgen
pl MyDG-swap-01 MyDG-swap ENABLED ACTIVE 62914560 CONCAT - RW
sd EMC1_0-01 MyDG-swap-01 EMC1_0 0 62914560 0 EMC1_0 ENA

5. The above command shows some plexes in NODEVICE state, so some disks might
have failed. Check the disk status using vxdisk command

(server1:/)# vxdisk list | grep MyDG-app
EMC1_0 auto:sliced EMC1_0 MyDG-app online
- - EMC0_0 MyDG-app failed was:EMC0_0
- - EMC0_1 MyDG-app failed was:EMC0_1
- - EMC0_2 MyDG-app failed was:EMC0_2
- - EMC0_4 MyDG-app failed was:EMC0_4
- - EMC0_5 MyDG-app failed was:EMC0_5
- - EMC0_16 MyDG-app failed was:EMC0_16
- - EMC0_17 MyDG-app failed was:EMC0_17
- - EMC0_18 MyDG-app failed was:EMC0_18
- - EMC0_19 MyDG-app failed was:EMC0_19
Also vxprint command shows the status
(server1:/)# vxprint -htg MyDG-app -d
DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE

dm EMC0_0 - - - - NODEVICE
dm EMC0_1 - - - - NODEVICE
dm EMC0_2 - - - - NODEVICE
dm EMC0_4 - - - - NODEVICE
dm EMC0_5 - - - - NODEVICE
dm EMC0_16 - - - - NODEVICE
dm EMC0_17 - - - - NODEVICE
dm EMC0_18 - - - - NODEVICE
dm EMC0_19 - - - - NODEVICE
dm EMC1_0 EMC1_0 auto 3583 70707840 -
6. The above command shows disks in failed state. This can be reattached using
vxreattach command

(server1:/)# /etc/vx/bin/vxreattach -c EMC0_0
MyDG-app EMC0_0
//-c option shows the status of which dg this disk is associated with
(server1:/)# /etc/vx/bin/vxreattach EMC0_1 //attach all disks similarly
7. Now check the status
(server1:/)# vxdisk list | grep MyDG-app
EMC0_0 auto:sliced EMC0_0 MyDG-app online
EMC0_1 auto:sliced EMC0_1 MyDG-app online
EMC0_2 auto:sliced EMC0_2 MyDG-app online
EMC0_4 auto:sliced EMC0_4 MyDG-app online
EMC0_5 auto:sliced EMC0_5 MyDG-app online
EMC0_16 auto:sliced EMC0_16 MyDG-app online
EMC0_17 auto:sliced EMC0_17 MyDG-app online
EMC0_18 auto:sliced EMC0_18 MyDG-app online
EMC0_19 auto:sliced EMC0_19 MyDG-app online
EMC1_0 auto:sliced EMC1_0 MyDG-app online

//vxdiskconfig - This command rescan all vxdisk and attach
8. Check the status again which shows plexes in RECOVER state
(server1:/)# vxprint -htg MyDG-app -v | grep pl
pl appPS4-01 appPS4 DISABLED RECOVER 12582912 CONCAT - RW
pl appPS4archreorg-01 appPS4archreorg DISABLED RECOVER 10485760 CONCAT - RW
pl appPS4mlogA-01 appPS4mlogA DISABLED RECOVER 1048576 CONCAT - RW
pl appPS4mlogB-01 appPS4mlogB DISABLED RECOVER 1048576 CONCAT - RW
pl appPS4ologA-01 appPS4ologA DISABLED RECOVER 1048576 CONCAT - RW
pl appPS4ologB-01 appPS4ologB DISABLED RECOVER 1048576 CONCAT - RW
pl appPS4appdata1-01 appPS4appdata1 DISABLED RECOVER 241172480 CONCAT - RW
pl appPS410264-01 appPS410264 DISABLED RECOVER 16777216 CONCAT - RW
pl appcle-01 appcle DISABLED RECOVER 4194304 CONCAT - RW
pl appclient-01 appclient DISABLED RECOVER 1048576 CONCAT - RW
pl appstage102-01 appstage102 DISABLED RECOVER 20971520 CONCAT - RW
pl apptemp-01 apptemp DISABLED RECOVER 52428800 CONCAT - RW
pl appmntPS4-01 appmntPS4 DISABLED RECOVER 10485760 CONCAT - RW
pl apptemp-01 apptemp DISABLED RECOVER 83886080 CONCAT - RW
pl usrapp-01 usrapp DISABLED RECOVER 2097152 CONCAT - RW
pl usrappPS4-01 usrappPS4 DISABLED RECOVER 25165824 CONCAT - RW
pl usrappSMD-01 usrappSMD DISABLED RECOVER 3145728 CONCAT - RW
pl usrappput-01 usrappput DISABLED CLEAN 10485760 CONCAT - RW
pl MyDG-swap-01 MyDG-swap ENABLED ACTIVE 62914560 CONCAT - RW

9. Now recover the volume by fixing the plex first to STALE and next to CLEAN

->Get the second column and get it in a file
(server1:/)# vxprint -htg MyDG-app -v | grep pl | awk '{print $2}' > /var/tmp/vxmendlist

->Loop each plex and fix the plex state
(server1:/)# for i in `cat /var/tmp/vxmendlist`
> do
> vxmend -g MyDG-app fix stale $i
> vxmend -g MyDG-app fix clean $i
> done

10. Now the state would be DISABLE CLEAN. Use vxvol to activate the volumes in dg

(server1:/)# vxvol -g MyDG-app startall

==>> We have successfully recovered the volumes :)

Some doc's on VxVM:
http://sfdoccentral.symantec.com/sf/5.0/solaris/pdf/vxvm_admin.pdf

Sunday, September 5, 2010

SUN Explorer

All machines are prone to failure at some point. Failures may be caused by any number of issues.

It could be a hardware failure.
It could be a software error.

Though high-end machines are mostly designed for hardware fault tolerant, there is no 100% assurance. So these can be expected, though such failures are very very rare in most high-end production environments.

For identifying the cause of the failure, data are needed about the system. One way of sun to generate such diagnostic data is through sun explorer.

Sun Explorer gathers information and creates a detailed snapshot of a system's configuration. It's output is designed to enable Sun support engineers to perform efective assessments of a system, provided a valid contract id exists for the support by sun. The explorer tool simply collects diagnostic information, including logs and system configuration, for analysis by Sun.

These are in packages SUNWexplo and SUNWexpl.

Explorer file can be found in /opt/SUNWexplo

-> Run the explorer

/opt/SUNWexplo/bin/explorer

-> Resultant explorer o/p is normally stored in

/opt/SUNWexplo/output

This explorer needs to be send to sun for diagnosis. There are other ways as well to collect data for diagnosis.

Saturday, September 4, 2010

du vs df - The Open File Descriptor Problem

You are more likely to encounter a situation in which df on a filesystem reports 100% whereas the actual usage would be vey less(reported by du)

eg)

(server1:/)# df -k /usrlocaltrpt

Filesystem kbytes used avail capacity Mounted on

/dev/vx/dsk/dg1/usrlocaltrpt

2031711 1970765 0 100% /usrlocaltrpt

(server1:/)# du -sh /usrlocaltrpt

254M /usrlocaltrpt

Open file descriptor is main causes of such wrong information.

For example if file called /usrlocaltrpt/temp.log is open by an application OR by a user and same file is deleted or somehow not accessible, both df and du reports different output

This situation could be solved in 2 ways,

1. Unmount the fs and remount

2. Find the process/application responsible and restart or kill them.

First way is simple but not feasible for some critical filesystems.

Second way needs some work to be done to identify the open files and the process associated with them.

->There is a 3rd party tool called as 'lsof' which lists the open files

->Scan the proc diretory for the files open and search manually

->user fuser -cu and find the process

eg) to find process from proc directory

(server1:/proc)# for i in *; do pfiles $i | grep -i trpt; done

/usr/local/trpt/Encore.log

/usr/local/trpt/modem.log

/usr/local/trpt/params.xml

(server1:/usrlocaltrpt)# fuser -cu /var/

/usrlocaltrpt/: 1519c(root) 26639o(root) 29155c(root) 21540o(root) 6184o(ipde) 1293o(noaccess) 1212o(root) 1091c(root) 1039o(root) 986co(root) 943o(root) 939o(root) 926o(root) 788co(smmsp) 783o(root) 745o(root) 743o(root) 742o(root) 731o(root) 683c(daemon) 657co(root) 185o(root) 7o(root)

(server1:/usrlocaltrpt)#

use a loop with the above pid's and find the associated process. Check for defunct process or a process that is opening a file.

Friday, September 3, 2010

Cron jobs

Cron jobs are the way of implementing scheduled jobs in Solaris (as with any Unix based system)

Below are some useful information on cron jobs. I am not mentioning the format of defining cron jobs. Just few commands and files related to cron jobs which are useful for trouble shooting.

crontab -l

-> Displays the cronjobs defined for the user who invoked.
-> crontab -l displays the cronjobs associated with the user

crontab -e
->Invokes the editor to edit the cronjobs for the invoking user.
->User's might have to export the EDITOR variable to properly open the editor

If crontab -l is not working, check who are the users who has cron jobs defined. This can be seen by root from the files in the below location
->/var/spool/cron/crontabs

Cron job logs are in /var/adm/cron -> Accessible to root and have entries defining the start and end of each jobs. Very usefull for trouble shooting.

check cron.allow and cron.deny files to see if the users has enough privilage to run cron job.
/etc/cron.d/cron.allow
/etc/cron.d/cron.deny

By default only root is allowed if cron.allow file is not available.

Check if the cron daemon is running or not.