Chef – couchdb migration

So turned out to be pretty easy to move my chef info from my old CentOS system to the new Ubuntu host once chef was installed

So turned out to be pretty easy to move my chef info from my old CentOS system to the new Ubuntu host once chef was installed (see http://mrmiller.nonesensedomains.com/2009/11/18/ubuntu-9-10-karmic-and-chef/ )

On the CentOS 5.3 host running Chef: 0.7.8 and couchdb – Apache CouchDB 0.9.0:

/etc/init.d/couchdb stop

I then copied /var/lib/couchdb/chef.couch to my admin nfs share, so I could pull it over to the new host.

On the Ubuntu host running Chef: 0.7.14 and couchdb – Apache CouchDB 0.10.0:

/etc/init.d/couchdb stop
/etc/init.d/chef-server stop
# backup old DB ..
cp /var/lib/couchdb/0.10.0/chef.couch /tmp/
cp /myadminmount/chef.couch to /var/lib/couchdb/0.10.0/chef.couch
chown couchdb:couchdb /var/lib/couchdb/0.10.0/chef.couch
/etc/init.d/couchdb start
/etc/init.d/chef-server start

So during the move I went from couch 0.9 to 0.10 and based on my reading it the DB is updated by simply running a compact.

curl -X POST http://localhost:5984/chef/_compact

At this point I just had to copy my cookbooks and site-cookbooks over to the Ubuntu host, which did bring up one problem. The default install of opscode chef does not enable site cookbooks so I had to edit /etc/chef/server.rb and update the cookbook_path and restart chef-server.

#cookbook_path [ “/srv/chef/site-cookbooks”, “/srv/chef/cookbooks” ]
cookbook_path [ “/srv/chef/cookbooks” ]

to

cookbook_path [ “/srv/chef/site-cookbooks”, “/srv/chef/cookbooks” ]
#cookbook_path [ “/srv/chef/cookbooks” ]

Then:

/etc/init.d/chef-server restart

When I logged into the interface and saw my roles and did a successful chef-client run on the system, WOOT.

I am still having a problem on the other clients but just need to figure out whats going on.

/usr/lib/ruby/1.8/net/http.rb:2097:in `error!’: 400 “Bad Request” (Net::HTTPServerException)

Favorite command today – disown

Often I start a process and realize its going to run longer then I really want to keep the session open. If I had know it would run that long I would have used screen or nohup but now its to late for that whats the fix? Simple background the process and then use the command “disown”. Now when you log out the command will finish running, o happy day!

Ubuntu 9.10 karmic and Chef

Quick notes on installing chef configuration management on Ubuntu 9.10 Karmic

Quick notes on installing chef configuration management on Ubuntu 9.10 Karmic, this is mostly taken directly from the chef wiki pages but kind of putting it all together and noting problems I ran into.

My automated install is a pretty tight server install:

%packages
openssh-server
curl
nfs-common
portmap
libnss-ldap
libpam-ldap
vlan

I want to install the newest version which is at opscode and not the version in karmic universe so I add the apt repo to the system.

echo “deb http://apt.opscode.com/ karmic universe” > /etc/apt/sources.list.d/opscode.list
curl http://apt.opscode.com/packages@opscode.com.gpg.key | sudo apt-key add –
apt-get update
# actually install chef-server
sudo apt-get install rubygems ohai chef chef-server

I have to manually install git for this server as its usually installed by chef

sudo apt-get -y install git-core

Now I install apache, and the apache modules

sudo apt-get -y install apache2

# module setup
for a2mod in proxy proxy_http proxy_balancer ssl rewrite headers
do
sudo a2enmod $a2mod
done

Now I create the virtual host:

Create /etc/apache2/sites-available/chef_server.repo with the following info, but replace server_fqdn with your chef fully qualified domain name.

<VirtualHost *:443>
ServerName server_fqdn
DocumentRoot /usr/share/chef-server/public

<Proxy balancer://chef_server>
BalancerMember http://127.0.0.1:4000
Order deny,allow
Allow from all
</Proxy>

LogLevel info
ErrorLog /var/log/apache2/chef_server-error.log
CustomLog /var/log/apache2/chef_server-access.log combined

SSLEngine On
SSLCertificateFile /etc/chef/certificates/server_fqdn.pem
SSLCertificateKeyFile /etc/chef/certificates/server_fqdn.pem

RequestHeader set X_FORWARDED_PROTO ‘https’

RewriteEngine On
RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_FILENAME} !-f
RewriteRule ^/(.*)$ balancer://chef_server%{REQUEST_URI} [P,QSA,L]
</VirtualHost>

<VirtualHost *:444%gt;
ServerName server_fqdn
DocumentRoot /usr/share/chef-server/public

<Proxy balancer://chef_server_openid>
BalancerMember http://127.0.0.1:4001
Order deny,allow
Allow from all
</Proxy>

LogLevel info
ErrorLog /var/log/apache2/chef_server-error.log
CustomLog /var/log/apache2/chef_server-access.log combined

SSLEngine On
SSLCertificateFile /etc/chef/certificates/server_fqdn.pem
SSLCertificateKeyFile /etc/chef/certificates/server_fqdn.pem

RequestHeader set X_FORWARDED_PROTO ‘https’

RewriteEngine On
RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_FILENAME} !-f
RewriteRule ^/(.*)$ balancer://chef_server_openid%{REQUEST_URI} [P,QSA,L]
</VirtualHost>

Checkout the chef repo:

cd
git clone git://github.com/opscode/chef-repo.git
cd chef-repo

Time to create your ssl cert

rake ssl_cert FQDN=chef.int.domain

Not sure what I am doing wrong here but I now run the install that for some reason does not copy the certs I just generated over … so I manually copy them over

rake install
cd /root/chef-repo/certificates
cp -a * /etc/chef/certificates/

Now we should be ready to restart apache and see if everything is working

sudo /etc/init.d/apache2 restart

We need to enable the chef virtual site

sudo a2ensite chef_server.repo
/etc/init.d/apache2 reload

now you should be able to bring up the Chef web interface in your browser, if you followed the directions in this writeup it will only work with https.

https://chef.int.domain/

Since I already have a open ldap gateway server configured I am able to log right in and confirm a running install, for more info on that see:

http://mrmiller.nonesensedomains.com/2009/09/18/chef-openid-to-ldap-gateway/

I always like to do a reboot after configuring a new host as even the best make mistakes from time to time and this way I can confirm that everything is starting/running as expected.

Next time I document my work on migrating the roles and cookbooks from my existing install on CentOS 5.3.

Additional notes:

I update my servers as part of the install, but found out that the couchdb that was with the Karmic on release would not start (local mirror was out of date). This was fixed by running

apt-get update
apt-get upgrade -y

I forgot to enable the chef virtual host at first and when I pulled up the URL in my browser got the following error: “SSL received a record that exceeded the maximum permissible length.”. Enabling the site and restarting apache fixed that right up.

Ref:

http://wiki.opscode.com/display/chef/Package+Installation+on+Debian+and+Ubuntu

http://wiki.opscode.com/display/chef/How+to+Proxy+Chef+Server+with+Apache

1 Month and 1 7410 less

OK so last post was about how much I liked the Sun 7410, this month forget all that. The last three weeks of my life have been a living hell thanks to Sun and stupid bugs.

OK so last post was about how much I liked the Sun 7410, this month forget all that. The last three weeks of my life have been a living hell thanks to Sun and stupid bugs.

The first bug I had the pleasure of hitting was the storage interface module load bug. It seems that SIMs that see to much traffic tend to go offline and have to be pulled from the chassis and resat before they come back on line. This suck but if that was the worst of it I would have been happy.

Release Note RN010
Title J4400 SIM cards fail under load
Platforms 7410
Related Bug IDs 6803801

Under heavy load in large configurations, the first SIM card (SIM 0) can fail. The symptoms are a blue LED on the card itself and an audible alarm, with possible alerts in the UI regarding paths and/or power supplies being removed from the chassis. I/O will continue down other available paths, and there no impact to availability, though performance may suffer. Re-seating the SIM card (removing it and inserting it) should fix the problem. If this problem persists, please contact Sun Support.

The second bug I hit seems to have to with a bad checksum generated by pools created with Q2 software. I was given bug “6794570 incomplete resilvering after disk replacement” by Sun but that seems to seriously under state what I faced. After updating to Q3 we went into a endless loop of resilvering, now to be fair in the end Sun also found a undetected SIM error that had us bouncing up and down for over two weeks. Seems that with large pools ( ours was 100TB usable double parity NSPF ) this check sum recalculation is almost guaranteed to fail as it kicks out drives it detects with checksum errors. At one point it kicked out enough drives to take the whole pool offline in a matter of seconds. Sun was able to reinsert the drives without data loss but without gold support I would have been SOL.

The third bug I hit was a akd crash which was really messed up. When akd crashes the second head tries to take over but akd restarts in the middle of the failure and causes a total hang. In this state nfs is no longer being served because you have a partial fail over situation. This was fixed by first shutting down the second head, meaning when akd died on head 1 it did not try to fail over but just restarted akd, which in turn caused the 20 hour resilver to restart! In the end Sun disabled akd on the one up head, which means no changes could be made, for the duration of the resilver. Once the resilver was completed they patched the akd on our system with a back port of Q4 fixes.

The system has been stabilized and we have been running well for about 5 days now, but in the end some errors made during the fix caused a loss of about 90k files and almost three weeks of lost time as the system was to unstable to run needed operations.

In the end we tried to save money and get through the beta stages with lower end hardware, and it came back to bite us. Its to bad cause the price point is so good on the Sun but price is not everything. We have traded in the Sun with another storage vendor (name withheld for now) and are trying to move on with life a little smarter and a little gun-shy.

“These opinions and postings are personal, and do not represent the opinions, positions or views of the Company or other employees of the Company.”