practice failure recovery
2011-04-26 19:11:55 GMT
In my test cluster I manged to jam up a cassandra server. I figure the easy & failsafe solution is to just boot a replacement node, but I thought I'd try a minute to either figure out what I did, or try to figure out how to properly recover it before I lose my current state.
The symptom = on startup I get an exception:
ERROR 11:58:34,567 Exception encountered during startup.
Where things went wrong = I had been doing various testing and unit testing, as this is my "proof of concept" cluster. The unit tests in particular work by cloning a keyspace as "keyspace_UUID" (to get a blank slate). Because of various bugs in my code and configuration, this left a fair amount of crud keyspaces by the time I got everything to pass. So, I wrote a script to drop all of the test keyspaces (the script had worked on a single node environment, which was my first step before the cluster). I think the CLI doesn't wait for schema propagation, so the script confused the node I was talking to, as after it ran the schema UUIDs of that node vs. the rest of the cluster didn't agree ("describe cluster" in the CLI). And, it wasn't fixing itself. "nodetool loadbalance" said it would do a decommission/bootstrap, which I thought might give the bad node a kick in the pants, so I tried it. Afterwards, I ran "nodetool ring" against all nodes and the problem node claimed all was "UP", but everything else listed the problem node as "?" and everything else as UP (sadly, I either didn't check or can't remember what "nodetool ring" said before loadbalance). So, I shut down the problem node. But, when I tried to restart it, I got the error you see above.
Not sure what was the worst/dumbest thing I did, but it's definitely unhappy now!