1 Month and 1 7410 less

OK so last post was about how much I liked the Sun 7410, this month forget all that. The last three weeks of my life have been a living hell thanks to Sun and stupid bugs.

OK so last post was about how much I liked the Sun 7410, this month forget all that. The last three weeks of my life have been a living hell thanks to Sun and stupid bugs.

The first bug I had the pleasure of hitting was the storage interface module load bug. It seems that SIMs that see to much traffic tend to go offline and have to be pulled from the chassis and resat before they come back on line. This suck but if that was the worst of it I would have been happy.

Release Note RN010
Title J4400 SIM cards fail under load
Platforms 7410
Related Bug IDs 6803801

Under heavy load in large configurations, the first SIM card (SIM 0) can fail. The symptoms are a blue LED on the card itself and an audible alarm, with possible alerts in the UI regarding paths and/or power supplies being removed from the chassis. I/O will continue down other available paths, and there no impact to availability, though performance may suffer. Re-seating the SIM card (removing it and inserting it) should fix the problem. If this problem persists, please contact Sun Support.

The second bug I hit seems to have to with a bad checksum generated by pools created with Q2 software. I was given bug “6794570 incomplete resilvering after disk replacement” by Sun but that seems to seriously under state what I faced. After updating to Q3 we went into a endless loop of resilvering, now to be fair in the end Sun also found a undetected SIM error that had us bouncing up and down for over two weeks. Seems that with large pools ( ours was 100TB usable double parity NSPF ) this check sum recalculation is almost guaranteed to fail as it kicks out drives it detects with checksum errors. At one point it kicked out enough drives to take the whole pool offline in a matter of seconds. Sun was able to reinsert the drives without data loss but without gold support I would have been SOL.

The third bug I hit was a akd crash which was really messed up. When akd crashes the second head tries to take over but akd restarts in the middle of the failure and causes a total hang. In this state nfs is no longer being served because you have a partial fail over situation. This was fixed by first shutting down the second head, meaning when akd died on head 1 it did not try to fail over but just restarted akd, which in turn caused the 20 hour resilver to restart! In the end Sun disabled akd on the one up head, which means no changes could be made, for the duration of the resilver. Once the resilver was completed they patched the akd on our system with a back port of Q4 fixes.

The system has been stabilized and we have been running well for about 5 days now, but in the end some errors made during the fix caused a loss of about 90k files and almost three weeks of lost time as the system was to unstable to run needed operations.

In the end we tried to save money and get through the beta stages with lower end hardware, and it came back to bite us. Its to bad cause the price point is so good on the Sun but price is not everything. We have traded in the Sun with another storage vendor (name withheld for now) and are trying to move on with life a little smarter and a little gun-shy.

“These opinions and postings are personal, and do not represent the opinions, positions or views of the Company or other employees of the Company.”

Sun 7410 Cluster

I highly suggest the Sun 7410 for anyone needing aux storage at a great price.

So last week I installed a new Sun 7410 cluster into the data center. Let me just start out with how much I love this thing! That said this is my second time purchasing the 7410, but this time I took the route of self install which I highly suggest. With my first cluster purchase while I was at Tagged, Inc I had Sun profession services come in and do the install, which turned out to be a real pain in the rear.

The 7410 cluster install with 6 shelves took 10 hours with a manual screw driver and at least part of that was due to misunderstanding on the docs. When it says you have to ssh to the ip you configured on the console it means it 🙂 . I had made the mistake of assuming that since I setup the out of band management IP via the console it would then drop me into the head controller config. I have a few gripes I will write up later but given the price deal I think they are things I can live with.