Tuesday, October 17, 2017

Oracle Exadata OneCommand - build virtual server April 2017

Many of you are probably very aware of Exadata at this point.  It's many years into its lifecycle (version 7 just announced), and it's very prevalent in the Oracle database landscape.  I was recently asked to rebuild a few physical nodes of Exadata into virtual nodes.  This is covered under MOS Note 2099488 "Migration of a Bare metal RAC cluster to an OVM RAC cluster on Exadata".

I wouldn't call this the best or worst written MOS note.  It contains four options for doing the rebuild.  No matter which option you’re going to use, be sure to read all four.  There are steps outlined in more detail in some options than other, and many of that background information is important.

Also, the Exadata build or deployment process is all based on OneCommand (or the Oracle Exadata Deployment Assistant / OEDA).  Make sure you go through all of the readme files and documentation for this tool as well.

Overview

Ok, so I'm not going to go through every step here, as there is a lot of background.  But in general, let’s get a quick outline of what we will be doing:

  1. Building a new configuration file from OEDA that will represent the new build out of the Exadata.
  2. Request any network / DNS changes that are needed to account for your system change (E.G. if you are adding more virtual servers, or clusters).  Once those changes are completed, run through the checkip script and verify the output is what you expect.
  3. Staging all the needed software, patches, and OneCommand tools for the build.   This list does come from the output of OEDA.  Note, if you are rebuilding servers, be sure to keep copies of all these files off of the local storage on your Exadata.  Such as a NFS mount or other shared storage that you can easily get to through the rebuild process.
  4. Downloading the server build USB or PXE image files (these are listed in the additional readme for the QFSDP of the version you are installing).  Then staging these files on your PXE boot / NFS server or creating a USB thumb drive to boot from.
  5. Cleaning up the storage cells if needed.  This depends on what you are doing to your system configuration and if are keeping or destroying your current databases and data.
  6. Rebuild the database nodes using the images setup in step 4.  This will setup the DOM0 / Oracle VM host on the Exadata.  Be sure to use the serial console through the service processor (ILOM), not the GUI / Java based console as it will not work.
  7. Run the post build steps of switching to the VM boot image, and reclaiming free space by removing the physical Exadata OS image.
  8. Setup SSH equivalency between DB nodes and Storage nodes.
  9. Stage the OneCommand utility on the first node, along with the needed patches and software install media.  Be sure to unzip the KLONE gold images from the proper patch zip file.  This is outlined in the OneCommand readme file.
  10. Execute the needed OneCommand steps to build the virtual servers, create the OS users, and setup the CELL connectivity.

From there you can continue on to cluster and database software install and a number of other post Exadata build steps.  There are 17 OneCommand steps in all, and what you will run will depend on your needs and what you are changing.

So, why should I write all this up?  Well during my latest attempt to do this work I ran into a few issues.  I wanted to expand on those here.  This is not all the issues, but it is a specifically tricky one that I did not get any help from Oracle support on.

Issues

During step 10 above, I ran into at least three issues.  The OneCommand output was of not help.  While executing step 2 "Create Virtual Machine", I received the following message:
"Error running oracle.onecommand.deploy.machines.VmUtils method createVMs"
There was slightly more information that that, but really nothing of value.

Digging through the log output I found at reference to "Unable to locate file"
db-klone-Linux-x86-64-12102170418.zip
grid-klone-Linux-x86-64-12102170418.zip

So, these two zip files are in the patches that OEDA / OneCommand ask to download in the configuration file.  Buried in the OneCommand readme is the details to unzip those two patches prior to running OneCommand.  In my case it was patches 25898234 and 25898235.  So just unzipping these two patches and I was able to move forward.  Or so I thought.

On the next run the log now changed, still saying "Unable to locate file", but the names changed:
db-klone-Linux-x86-64-12102170814.zip
grid-klone-Linux-x86-64-12102170814.zip

See the issue?  The date stamps have the month and day digits transposed.  I couldn't find this anywhere in any of the OneCommand configuration files that were human readable (XML or Text).
So, I cheated, creating a symbolic link from one file to the other:
cd WorkDir
ln -s ./db-klone-Linux-x86-64-12102170418.zip ./db-klone-Linux-x86-64-12102170814.zip
ln -s ./grid-klone-Linux-x86-64-12102170418.zip ./grid-klone-Linux-x86-64-12102170814.zip

Now feeling confident that run three should just work.  Unfortunately, it did not.
Same error message that is of no value:
"Error running oracle.onecommand.deploy.machines.VmUtils method createVMs"
Back to the log file I go.

In the log file, there is a section where the Java routine gets "Exception: null".  Just prior to this exception the application is trying to get a list of system first boot images.  The last line was referencing "System.first.boot.12.2.1.1.1.170419.img.bz2".  Hum, that is the image file used to build the virtual machine with.

Digging into this some, for my build the version of that image we were using is April 2017.  The information above looks right.  I double checked the patch for that image, patch number 25742355.  This information is in the additional readme for the April 2017 QFSDP for Exadata, and is also in the list from the OEDA Installation Template HTML output.

I also verified that the patch was in my WorkDir location and that the zip file was in good shape.  No issues there.

Next, I dug into the OneCommand configuration files.  In the properties directory there is a es.propreties file that contains all these patch file names and versions.  There is a section that covers the VM first boot images.  Going through the list I find this line:
12.2.1.1.1,System.first.boot.12.2.1.1.1.170323.img.bz2,12.2.1.1.1, \
  p25742355_122111_Linux-x86-64.zip,12.2.1.1.1.170323:\

Ah, well now I see this issue.  Again, a data miss-match between two things inside OneCommand.  Clearly Oracle updated the patch, but didn't update OneCommand in all the right places.  I moved that line out of the list and commented it out.  Then I added the following line:
12.2.1.1.1,System.first.boot.12.2.1.1.1.170419.img.bz2,12.2.1.1.1, \
  p25742355_122111_Linux-x86-64.zip,12.2.1.1.1.170419:\

Be sure to watch where in the list you make edits, and watch the colons and backslashes’ to not corrupt the data array.

Conclusion 

Now the build of the VM's continued normally and I was able to proceed.
What a messy situation where Oracle is just not keeping everything in sync with each other.  It's clear that OneCommand knew what it was looking for in one way (from the log file), but the reality in the configuration files was slightly different.  Seems like a little bit of a house of cards, with too many moving parts. 

I'll leave this story at this point.  Hopefully this helps someone out there that may be running into the same issues with Exadata OneCommand.