Chef 8.0 almost here?

Its starting to feel like 8.0 will never ship and I just dont feel its ready to run in production just yet based on the lack of documentation but I have tasted enough to know I want it.

Been busy as heck around here at Rdio, Inc still loving chef but can not wait for 8.0

Some features I am looking forward to:

Knife: a command-line utility used to interact with a Chef server directly through the RESTful API.
one of the best parts of this that I have seen is that it will make multiple admins much easier to deal with. My favorite command so far: cookbook upload

Openid no longer only option for logins: Infact the whole login stuff has changed and with knife there will be even less reason then ever to login to the UI, this is a major change as the whole auth stuff is in flux right now.

Better Serach: now this one I have not played with much but they say it will be much better based partially on the databag addition

Databags: Data bags are arbitrary stores of JSON data on the server that get indexed for search.
This will help you store data that is used across recipes with less effort.

I am sure there are more, but those are the ones I have played with so far. Its starting to feel like 8.0 will never ship and I just dont feel its ready to run in production just yet based on the lack of documentation but I have tasted enough to know I want it.

Chef – couchdb migration

So turned out to be pretty easy to move my chef info from my old CentOS system to the new Ubuntu host once chef was installed

So turned out to be pretty easy to move my chef info from my old CentOS system to the new Ubuntu host once chef was installed (see http://mrmiller.nonesensedomains.com/2009/11/18/ubuntu-9-10-karmic-and-chef/ )

On the CentOS 5.3 host running Chef: 0.7.8 and couchdb – Apache CouchDB 0.9.0:

/etc/init.d/couchdb stop

I then copied /var/lib/couchdb/chef.couch to my admin nfs share, so I could pull it over to the new host.

On the Ubuntu host running Chef: 0.7.14 and couchdb – Apache CouchDB 0.10.0:

/etc/init.d/couchdb stop
/etc/init.d/chef-server stop
# backup old DB ..
cp /var/lib/couchdb/0.10.0/chef.couch /tmp/
cp /myadminmount/chef.couch to /var/lib/couchdb/0.10.0/chef.couch
chown couchdb:couchdb /var/lib/couchdb/0.10.0/chef.couch
/etc/init.d/couchdb start
/etc/init.d/chef-server start

So during the move I went from couch 0.9 to 0.10 and based on my reading it the DB is updated by simply running a compact.

curl -X POST http://localhost:5984/chef/_compact

At this point I just had to copy my cookbooks and site-cookbooks over to the Ubuntu host, which did bring up one problem. The default install of opscode chef does not enable site cookbooks so I had to edit /etc/chef/server.rb and update the cookbook_path and restart chef-server.

#cookbook_path [ “/srv/chef/site-cookbooks”, “/srv/chef/cookbooks” ]
cookbook_path [ “/srv/chef/cookbooks” ]

to

cookbook_path [ “/srv/chef/site-cookbooks”, “/srv/chef/cookbooks” ]
#cookbook_path [ “/srv/chef/cookbooks” ]

Then:

/etc/init.d/chef-server restart

When I logged into the interface and saw my roles and did a successful chef-client run on the system, WOOT.

I am still having a problem on the other clients but just need to figure out whats going on.

/usr/lib/ruby/1.8/net/http.rb:2097:in `error!’: 400 “Bad Request” (Net::HTTPServerException)

Favorite command today – disown

Often I start a process and realize its going to run longer then I really want to keep the session open. If I had know it would run that long I would have used screen or nohup but now its to late for that whats the fix? Simple background the process and then use the command “disown”. Now when you log out the command will finish running, o happy day!

Ubuntu 9.10 karmic and Chef

Quick notes on installing chef configuration management on Ubuntu 9.10 Karmic

Quick notes on installing chef configuration management on Ubuntu 9.10 Karmic, this is mostly taken directly from the chef wiki pages but kind of putting it all together and noting problems I ran into.

My automated install is a pretty tight server install:

%packages
openssh-server
curl
nfs-common
portmap
libnss-ldap
libpam-ldap
vlan

I want to install the newest version which is at opscode and not the version in karmic universe so I add the apt repo to the system.

echo “deb http://apt.opscode.com/ karmic universe” > /etc/apt/sources.list.d/opscode.list
curl http://apt.opscode.com/packages@opscode.com.gpg.key | sudo apt-key add –
apt-get update
# actually install chef-server
sudo apt-get install rubygems ohai chef chef-server

I have to manually install git for this server as its usually installed by chef

sudo apt-get -y install git-core

Now I install apache, and the apache modules

sudo apt-get -y install apache2

# module setup
for a2mod in proxy proxy_http proxy_balancer ssl rewrite headers
do
sudo a2enmod $a2mod
done

Now I create the virtual host:

Create /etc/apache2/sites-available/chef_server.repo with the following info, but replace server_fqdn with your chef fully qualified domain name.

<VirtualHost *:443>
ServerName server_fqdn
DocumentRoot /usr/share/chef-server/public

<Proxy balancer://chef_server>
BalancerMember http://127.0.0.1:4000
Order deny,allow
Allow from all
</Proxy>

LogLevel info
ErrorLog /var/log/apache2/chef_server-error.log
CustomLog /var/log/apache2/chef_server-access.log combined

SSLEngine On
SSLCertificateFile /etc/chef/certificates/server_fqdn.pem
SSLCertificateKeyFile /etc/chef/certificates/server_fqdn.pem

RequestHeader set X_FORWARDED_PROTO ‘https’

RewriteEngine On
RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_FILENAME} !-f
RewriteRule ^/(.*)$ balancer://chef_server%{REQUEST_URI} [P,QSA,L]
</VirtualHost>

<VirtualHost *:444%gt;
ServerName server_fqdn
DocumentRoot /usr/share/chef-server/public

<Proxy balancer://chef_server_openid>
BalancerMember http://127.0.0.1:4001
Order deny,allow
Allow from all
</Proxy>

LogLevel info
ErrorLog /var/log/apache2/chef_server-error.log
CustomLog /var/log/apache2/chef_server-access.log combined

SSLEngine On
SSLCertificateFile /etc/chef/certificates/server_fqdn.pem
SSLCertificateKeyFile /etc/chef/certificates/server_fqdn.pem

RequestHeader set X_FORWARDED_PROTO ‘https’

RewriteEngine On
RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_FILENAME} !-f
RewriteRule ^/(.*)$ balancer://chef_server_openid%{REQUEST_URI} [P,QSA,L]
</VirtualHost>

Checkout the chef repo:

cd
git clone git://github.com/opscode/chef-repo.git
cd chef-repo

Time to create your ssl cert

rake ssl_cert FQDN=chef.int.domain

Not sure what I am doing wrong here but I now run the install that for some reason does not copy the certs I just generated over … so I manually copy them over

rake install
cd /root/chef-repo/certificates
cp -a * /etc/chef/certificates/

Now we should be ready to restart apache and see if everything is working

sudo /etc/init.d/apache2 restart

We need to enable the chef virtual site

sudo a2ensite chef_server.repo
/etc/init.d/apache2 reload

now you should be able to bring up the Chef web interface in your browser, if you followed the directions in this writeup it will only work with https.

https://chef.int.domain/

Since I already have a open ldap gateway server configured I am able to log right in and confirm a running install, for more info on that see:

http://mrmiller.nonesensedomains.com/2009/09/18/chef-openid-to-ldap-gateway/

I always like to do a reboot after configuring a new host as even the best make mistakes from time to time and this way I can confirm that everything is starting/running as expected.

Next time I document my work on migrating the roles and cookbooks from my existing install on CentOS 5.3.

Additional notes:

I update my servers as part of the install, but found out that the couchdb that was with the Karmic on release would not start (local mirror was out of date). This was fixed by running

apt-get update
apt-get upgrade -y

I forgot to enable the chef virtual host at first and when I pulled up the URL in my browser got the following error: “SSL received a record that exceeded the maximum permissible length.”. Enabling the site and restarting apache fixed that right up.

Ref:

http://wiki.opscode.com/display/chef/Package+Installation+on+Debian+and+Ubuntu

http://wiki.opscode.com/display/chef/How+to+Proxy+Chef+Server+with+Apache

1 Month and 1 7410 less

OK so last post was about how much I liked the Sun 7410, this month forget all that. The last three weeks of my life have been a living hell thanks to Sun and stupid bugs.

OK so last post was about how much I liked the Sun 7410, this month forget all that. The last three weeks of my life have been a living hell thanks to Sun and stupid bugs.

The first bug I had the pleasure of hitting was the storage interface module load bug. It seems that SIMs that see to much traffic tend to go offline and have to be pulled from the chassis and resat before they come back on line. This suck but if that was the worst of it I would have been happy.

Release Note RN010
Title J4400 SIM cards fail under load
Platforms 7410
Related Bug IDs 6803801

Under heavy load in large configurations, the first SIM card (SIM 0) can fail. The symptoms are a blue LED on the card itself and an audible alarm, with possible alerts in the UI regarding paths and/or power supplies being removed from the chassis. I/O will continue down other available paths, and there no impact to availability, though performance may suffer. Re-seating the SIM card (removing it and inserting it) should fix the problem. If this problem persists, please contact Sun Support.

The second bug I hit seems to have to with a bad checksum generated by pools created with Q2 software. I was given bug “6794570 incomplete resilvering after disk replacement” by Sun but that seems to seriously under state what I faced. After updating to Q3 we went into a endless loop of resilvering, now to be fair in the end Sun also found a undetected SIM error that had us bouncing up and down for over two weeks. Seems that with large pools ( ours was 100TB usable double parity NSPF ) this check sum recalculation is almost guaranteed to fail as it kicks out drives it detects with checksum errors. At one point it kicked out enough drives to take the whole pool offline in a matter of seconds. Sun was able to reinsert the drives without data loss but without gold support I would have been SOL.

The third bug I hit was a akd crash which was really messed up. When akd crashes the second head tries to take over but akd restarts in the middle of the failure and causes a total hang. In this state nfs is no longer being served because you have a partial fail over situation. This was fixed by first shutting down the second head, meaning when akd died on head 1 it did not try to fail over but just restarted akd, which in turn caused the 20 hour resilver to restart! In the end Sun disabled akd on the one up head, which means no changes could be made, for the duration of the resilver. Once the resilver was completed they patched the akd on our system with a back port of Q4 fixes.

The system has been stabilized and we have been running well for about 5 days now, but in the end some errors made during the fix caused a loss of about 90k files and almost three weeks of lost time as the system was to unstable to run needed operations.

In the end we tried to save money and get through the beta stages with lower end hardware, and it came back to bite us. Its to bad cause the price point is so good on the Sun but price is not everything. We have traded in the Sun with another storage vendor (name withheld for now) and are trying to move on with life a little smarter and a little gun-shy.

“These opinions and postings are personal, and do not represent the opinions, positions or views of the Company or other employees of the Company.”

Sun 7410 Cluster

I highly suggest the Sun 7410 for anyone needing aux storage at a great price.

So last week I installed a new Sun 7410 cluster into the data center. Let me just start out with how much I love this thing! That said this is my second time purchasing the 7410, but this time I took the route of self install which I highly suggest. With my first cluster purchase while I was at Tagged, Inc I had Sun profession services come in and do the install, which turned out to be a real pain in the rear.

The 7410 cluster install with 6 shelves took 10 hours with a manual screw driver and at least part of that was due to misunderstanding on the docs. When it says you have to ssh to the ip you configured on the console it means it 🙂 . I had made the mistake of assuming that since I setup the out of band management IP via the console it would then drop me into the head controller config. I have a few gripes I will write up later but given the price deal I think they are things I can live with.

Chef openid to ldap gateway

Setting up a openid to ldap gateway for chef authentication.

So one of my main complaints about chef and I might mention the office joke is the use of openid for authentication of users. This presented two problems for me, one that I would never trust the authentication to my management server to a outside source and second that my chef server does not have internet access. Chef pointed me in the direction of http://www.openid-ldap.org/ and after a little wresting I was able to have a working internal openid auth system using already existing ldap auth system.

I am running openidldap on a system that I have configured to handle admin web apps, and the install consisted of simply creating the web root and updating the ldap.php. A few things I did fine useful was to rename it the directory to openid. The ldap.php was pretty easy but one place I did get stuck was that I did not clearly read the directions and tried to create .htaccess files rather then just update /etc/httpd/conf.d/ssl.conf and /etc/httpd/conf/httpd.conf like they said.

If you follow the directions it should be a 10 minute setup at most.

Untar the file in your webroot, rename directory to openid

append to httpd.conf or virtualhost.conf if your using one


---
RewriteEngine On

RewriteRule ^/openid$ https://openid.int.mycompany/openid/ [R=permanent,L]
RewriteRule ^/openid/$ https://openid.int.mycompany/openid/ [R=permanent,L]
RewriteRule ^/openid/(.*)$ https://openid.int.mycompany/openid/$1 [R=permanent,L]
---

insert inside the virtualhost of ssl.conf


---
SSLProxyEngine On
RewriteEngine On

RewriteCond %{REQUEST_URI} !^/(.+)\.php(.*)$
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /openid/([A-Za-z0-9]+)\?(.*)\ HTTP/
RewriteRule ^/openid/(.*)$ https://openid.int.mycompany/openid/index.php?user=%1&%2 [P]

RewriteCond %{REQUEST_URI} !^/(.+)\.php(.*)$
RewriteRule ^/openid/([A-Za-z0-9]+)$ https://openid.int.mycompany/openid/index.php?user=$1 [P]
---

update the ldap.php in the openid directory you just created, which is pretty clear but I did have to edit the following lines to make sure the name showed up correctly.


---
# SREG names matching to LDAP attribute names
'nickname' => 'uid',
'email' => 'mail',
'fullname' => 'cn',
---

then simply test by going to https://yourhostname/openid/

One thing I have yet to fix is that my chef server straddles two networks, one side that I can access form the office the other that servers talk on and this creates havok on my logins, for now I wound up creating a openid.int.mycompany entry pointing to the ip visible to my mac and that gets me round he problem of int.mycompany not being routable outside the server server network.

Bug in chef::client recipe on CentOS 5 ( or not )

So while now that I have fixed it so I can run the chef::client recipe a new error has cropped up, and how I fixed it.

So while now that I have fixed it so I can run the chef::client recipe a new error has cropped up, and how I fixed it.

[Tue, 15 Sep 2009 22:06:36 -0700] INFO: Creating a symbolic link from -> /etc/init.d/chef-client for link[/etc/init.d/chef-client]
[Tue, 15 Sep 2009 22:06:36 -0700] INFO: Creating a symbolic link from /chef-client -> /chef-client for link[/chef-client]
[Tue, 15 Sep 2009 22:06:38 -0700] INFO: Chef Run complete in 3.967334 seconds
[root@srv-101-25 ~]# chef-client
/usr/lib/ruby/1.8/net/http.rb:560:in `initialize’: getaddrinfo: Name or service not known (SocketError)
from /usr/lib/ruby/1.8/net/http.rb:560:in `open’
from /usr/lib/ruby/1.8/net/http.rb:560:in `connect’
from /usr/lib/ruby/1.8/timeout.rb:53:in `timeout’
from /usr/lib/ruby/1.8/timeout.rb:93:in `timeout’
from /usr/lib/ruby/1.8/net/http.rb:560:in `connect’
from /usr/lib/ruby/1.8/net/http.rb:553:in `do_start’
from /usr/lib/ruby/1.8/net/http.rb:542:in `start’
from /usr/lib/ruby/1.8/net/http.rb:1035:in `request’
… 7 levels…
from /usr/lib/ruby/gems/1.8/gems/chef-0.7.8/lib/chef/application.rb:57:in `run’
from /usr/lib/ruby/gems/1.8/gems/chef-0.7.8/bin/chef-client:26
from /usr/bin/chef-client:19:in `load’
from /usr/bin/chef-client:19

Update 1:
Found the problem, when it installs the client is generates a new /etc/chef/client.rb file that uses the local server name as default value for the chef server, which is of course a problem.

[root@srv-101-10 ~]# vi /etc/chef/client.rb
---
# Chef Client Config File
#
# Dynamically generated by Chef - local modifications will be replaced
#

log_level :info
log_location "/var/log/chef/client.log"
ssl_verify_mode :verify_none
registration_url "https://srv-101-10.int.mycompany.int.mycompany"
openid_url "https://srv-101-10.int.mycompany.int.mycompany:444"
template_url "https://srv-101-10.int.mycompany.int.mycompany"
remotefile_url "https://srv-101-10.int.mycompany.int.mycompany"
search_url "https://srv-101-10.int.mycompany.int.mycompany"
role_url "https://srv-101-10.int.mycompany.int.mycompany"

file_store_path "/srv/chef/file_store"
file_cache_path "/srv/chef/cache"

pid_file "/var/run/chef/chef-client.pid"

Chef::Log::Formatter.show_time = true
---

where as it should read:

---
#
# Chef Client Config File
#
# Dynamically generated by Chef - local modifications will be replaced
#

log_level :info
log_location "/var/log/chef/client.log"
ssl_verify_mode :verify_none
registration_url "https://chef.int.mycompany"
openid_url "https://chef.int.mycompany:444"
template_url "https://chef.int.mycompany"
remotefile_url "https://chef.int.mycompany"
search_url "https://chef.int.mycompany"
role_url "https://chef.int.mycompany"

file_store_path "/srv/chef/file_store"
file_cache_path "/srv/chef/cache"

pid_file "/var/run/chef/chef-client.pid"

Chef::Log::Formatter.show_time = true
---

Now I am off to find the problem causing recipe

and the fix turned out to be updating the json overide properties for my COMPANY_BASE_ROLE


---
{
"defaults":{},
"overrides":{
"chef":{
"client_splay":"20",
"client_interval":"600",
"server_fqdn":"chef.int.company"
},
"authorization":{
"sudo":{
"groups":["group1"
],
"users":[]
}
},
"ntp":{
"is_server":false,
"service":"ntpd",
"servers":["time01.int.company",
"time02.int.company"
]
}
}
}
---

Chef for fun and maybe some profit

Upon joining my new company last month I came into the perfect env of empty servers and all the freedom I wanted. I had been testing Cobbler https://fedorahosted.org/cobbler/ and Chef http://wiki.opscode.com/display/chef/Home over the last month as a replacement for my home grown build system. Well the joy of testing on virtual systems did not truly expose me to the joys of deploying chef in a closed system. I designed the environment to not be reaching outside of our network for anything and chef did not like that, but it turned out to be OK after lots and lots of fun.

Upon joining my new company last month I came into the perfect env of empty servers and all the freedom I wanted. I had been testing Cobbler https://fedorahosted.org/cobbler/ and Chef http://wiki.opscode.com/display/chef/Home the previous month at my at Tagged, Inc as a replacement for my home grown build system that had been implemented there and at Pay By Touch. Well the joy of testing on virtual systems did not truly expose me to the joys of deploying chef in a closed environment. As any security minded person would do I designed the new environment to not allow reaching outside of the local network for anything and chef did not like that, but it turned out to be OK after lots and lots of fun.

My cobbler server is providing a local mirror of http://elff.bravenet.com/, and I have pulled down the current bootstrap file to my cobbler system Apache server.

In the package section of my company_base.ks file, I include

rubygem-chef

Based on notes I found for puppet install I created a snippet in my cobbler install:

Then to prep for install of chef client this is run before the %post section in my company_base.ks file
$SNIPPET(‘company_chef_chroot’)

[jmiller@cobbler ~]$ cat /var/lib/cobbler/snippets/company_chef_nochroot

# Make sure we have network stuff in place so when we register with the server all is well

%post --nochroot
# Copy netinfo, which has our FQDN from DHCP, into the chroot
test -f /tmp/netinfo && cp /tmp/netinfo /mnt/sysimage/tmp/

This snippet in my company_base.ks file installs, validates, and first runs the chef client
$SNIPPET(‘rdio_chef_client’)

[jmiller@cobbler ~]$ cat /var/lib/cobbler/snippets/company_chef_client

# In this script we actually install the client

cat < /root/solo.rb
file_cache_path "/tmp/chef-solo"
cookbook_path "/tmp/chef-solo/cookbooks"
EOF

cat < /root/chef.json
{
"chef": {
"server_fqdn": "chef.int.company"
},
"packages": {
"dist_only": true
},
"recipes": "chef::client"
}
EOF

cat < /root/client.json
{
"run_list": ["role[COMPANY_BASE]"]
}
EOF

# Configure the Env
echo "Installing Chef Bootstrap"
cd /root/
chef-solo -c solo.rb -j chef.json -r http://chef.int.company/bootstrap-0.7.8.tar.gz
cd -

# register with the server
echo "Register with Chef"
chef-client -t "myAuthToken" -j /root/client.json

chef-client
[jmiller@cobbler ~]$

My COMPANY_BASE role in chef was lacking a few recipes and threw me for a huge loop.

recipes in COMPANY_BASE ( chef chef::client sudo screen ntp openssh snmp git )


[root@srv-101-25 ~]# chef-client -l debug -j /root/client.json
/usr/lib/ruby/gems/1.8/gems/chef-0.7.8/lib/chef/recipe.rb:200:in `method_missing': Cannot find Chef::Resource::DistOnly? for dist_only? (NameError)
Original: undefined method `DistOnly?' for Chef::Resource:Class

After a lot of troubleshooting with Joshua Timberman of Opsec we found out that I needed two more recipes ( packages & runit ), turns out this is limit with RPM based systems and caused me a lot of hurt.

Sites I owe a lot of thank you to:
http://wiki.opscode.com/display/chef/Home
https://fedorahosted.org/cobbler/
http://reductivelabs.com/trac/puppet/wiki/BootstrappingWithPuppet
http://wiki.opscode.com/display/chef/Installation+on+RHEL+and+CentOS+5+with+RPMs