Chef restore from backup

So I was testing my restore from backup for chef and ran into a few problems. The first problem I encountered was that my nginx load balancers config files are dynamically created based role assigned to boxes. After my restore one of the first boxes I tested was one of the LB boxes and to my horror even thought the systems where listed when I did a chef node list it seems that until they have check into the restored chef server they are not counted. This means that my nginx config server pools where empty … bummer. The easy fix here was to have my servers move over to the restored chef server instance from the bottom up … i.e. sql boxes, web boxes, then edge lb stuff. Not a huge problem but it does mean if you ever have to retore a chef box, stop all client before you bring it up.

The other odd problem I had was one node that had a local variable assigned to it did not pull the var over. Now the variable in question had not changed in months and my daily backups should have contained this info. I got lucky that even though it was a db password access for the system, I had removed the notify restart of a lot of services before the restore to minimize impact of changes but over it went pretty well.

My backups … tar zcvf `date +%Y%m%d`.`hostname`.chef.tar.gz /var/lib/couchdb/ /etc/chef

Restore, build server, install chef-server, stop chef-server, drop tar into place and start chef-server.

One final thought, I had to restore a 0.8.16 system after 0.9.8 was out which turns out to be a problem as the bootstrap latest files do not work with 0.8.16. Luckily I had a local copy of the boot strap that I used for 0.8.x installs and was able to run from there. I suggest you backup any files you use for installs locally just incase.

Chef error: marshal data too short

WARN: HTTP Request Returned 500 Internal Server Error: marshal data too short … what to do?

jmiller@srv-101-29:~$ sudo chef-client
[Tue, 10 Aug 2010 12:36:13 -0700] INFO: Starting Chef Run
[Tue, 10 Aug 2010 12:36:28 -0700] WARN: HTTP Request Returned 500 Internal Server Error: marshal data too short
/usr/lib/ruby/1.8/net/http.rb:2097:in `error!’: 500 “Internal Server Error” (Net::HTTPFatalError)
from /usr/lib/ruby/1.8/chef/rest.rb:216:in `api_request’
from /usr/lib/ruby/1.8/chef/rest.rb:267:in `retriable_rest_request’
from /usr/lib/ruby/1.8/chef/rest.rb:197:in `api_request’
from /usr/lib/ruby/1.8/chef/rest.rb:100:in `get_rest’
from /usr/lib/ruby/1.8/chef/client.rb:270:in `sync_cookbooks’
from /usr/lib/ruby/1.8/chef/client.rb:86:in `run’
from /usr/lib/ruby/1.8/chef/application/client.rb:215:in `run_application’
from /usr/lib/ruby/1.8/chef/application/client.rb:207:in `loop’
from /usr/lib/ruby/1.8/chef/application/client.rb:207:in `run_application’
from /usr/lib/ruby/1.8/chef/application.rb:62:in `run’
from /usr/bin/chef-client:25
jmiller@srv-101-29:~$

So looking at this I thought it was a checksum error on the client and deleted the /var/chef/cache directory without luck. After digging around I found that stopping the chef server and deleting /var/chef/cache/checksums, then restarting chef server fixed the problem. Easy fix but odd problem. Chef 0.8.16

MegaCLI Raid6 Array creation

I am using Ubuntu Karmic on Dell R610 to access MD1200 storage devices and since (until recently) Openmanage was not a option for the H800 SAS Raid adaptors so I had to explore the wonderful megacli utility!

I am using Ubuntu Karmic on Dell R610 to access MD1200 storage devices and since (until recently) Openmanage was not a option for the H800 SAS Raid adaptors so I had to explore the wonderful megacli utility!

# Find unused disks

root@srv-103-27:/opt/MegaRAID/MegaCli# ./MegaCli64 -PDList -a0 | grep -B14 Unconfigured | grep -e ‘^Enclosure Device ID:’ -e ‘^Slot Number:’
Enclosure Device ID: 41
Slot Number: 11
Enclosure Device ID: 80
Slot Number: 0
Enclosure Device ID: 80
Slot Number: 1
Enclosure Device ID: 80
Slot Number: 2ID: 80
Enclosure Device ID: 80
Slot Number: 3
Enclosure Device ID: 80
Slot Number: 4
Enclosure Device ID: 80
Slot Number: 5
Enclosure Device ID: 80
Slot Number: 6
Enclosure Device ID: 80
Slot Number: 7
Enclosure Device ID: 80
Slot Number: 8
Enclosure Device ID: 80
Slot Number: 9
Enclosure Device ID: 80
Slot Number: 10
Enclosure Device ID: 80
Slot Number: 11
Enclosure Device ID: 106
Slot Number: 0
Enclosure Device ID: 106
Slot Number: 1
Enclosure Device ID: 106
Slot Number: 2
Enclosure Device ID: 106
Slot Number: 3
Enclosure Device ID: 106
Slot Number: 4
Enclosure Device ID: 106
Slot Number: 5
Enclosure Device ID: 106
Slot Number: 6
Enclosure Device ID: 106
Slot Number: 7
Enclosure Device ID: 106
Slot Number: 8
Enclosure Device ID: 106
Slot Number: 9
Enclosure Device ID: 106
Slot Number: 10
Enclosure Device ID: 106
Slot Number: 11
root@srv-103-27:/opt/MegaRAID/MegaCli#

# Create Raid 6 Volume

root@srv-103-27:/opt/MegaRAID/MegaCli# ./MegaCli64 -CfgLdAdd -r6 [80:0,80:1,80:2,80:3,80:4,80:5,80:6,80:7,80:8,80:9,80:10] -a0

Adapter 0: Created VD 5

Adapter 0: Configured the Adapter!!

Exit Code: 0x00
root@srv-103-27:/opt/MegaRAID/MegaCli#

# add dedicated hot spares, we use dedicated as they stay with the array/shelf
root@srv-103-27:/opt/MegaRAID/MegaCli# ./MegaCli64 -PDHSP -Set -Dedicated -Array5 -PhysDrv [80:11] -a0

Adapter: 0: Set Physical Drive at EnclId-80 SlotId-11 as Hot Spare Success.

Exit Code: 0x00