OK so last post was about how much I liked the Sun 7410, this month forget all that. The last three weeks of my life have been a living hell thanks to Sun and stupid bugs.
The first bug I had the pleasure of hitting was the storage interface module load bug. It seems that SIMs that see to much traffic tend to go offline and have to be pulled from the chassis and resat before they come back on line. This suck but if that was the worst of it I would have been happy.
Release Note RN010
Title J4400 SIM cards fail under load
Related Bug IDs 6803801
Under heavy load in large configurations, the first SIM card (SIM 0) can fail. The symptoms are a blue LED on the card itself and an audible alarm, with possible alerts in the UI regarding paths and/or power supplies being removed from the chassis. I/O will continue down other available paths, and there no impact to availability, though performance may suffer. Re-seating the SIM card (removing it and inserting it) should fix the problem. If this problem persists, please contact Sun Support.
The second bug I hit seems to have to with a bad checksum generated by pools created with Q2 software. I was given bug “6794570 incomplete resilvering after disk replacement” by Sun but that seems to seriously under state what I faced. After updating to Q3 we went into a endless loop of resilvering, now to be fair in the end Sun also found a undetected SIM error that had us bouncing up and down for over two weeks. Seems that with large pools ( ours was 100TB usable double parity NSPF ) this check sum recalculation is almost guaranteed to fail as it kicks out drives it detects with checksum errors. At one point it kicked out enough drives to take the whole pool offline in a matter of seconds. Sun was able to reinsert the drives without data loss but without gold support I would have been SOL.
The third bug I hit was a akd crash which was really messed up. When akd crashes the second head tries to take over but akd restarts in the middle of the failure and causes a total hang. In this state nfs is no longer being served because you have a partial fail over situation. This was fixed by first shutting down the second head, meaning when akd died on head 1 it did not try to fail over but just restarted akd, which in turn caused the 20 hour resilver to restart! In the end Sun disabled akd on the one up head, which means no changes could be made, for the duration of the resilver. Once the resilver was completed they patched the akd on our system with a back port of Q4 fixes.
The system has been stabilized and we have been running well for about 5 days now, but in the end some errors made during the fix caused a loss of about 90k files and almost three weeks of lost time as the system was to unstable to run needed operations.
In the end we tried to save money and get through the beta stages with lower end hardware, and it came back to bite us. Its to bad cause the price point is so good on the Sun but price is not everything. We have traded in the Sun with another storage vendor (name withheld for now) and are trying to move on with life a little smarter and a little gun-shy.
“These opinions and postings are personal, and do not represent the opinions, positions or views of the Company or other employees of the Company.”