16th
Hey, how’s dumbo running a M/R combine? Uh, well…
So, debugging sluggish Hadoop M/R jobs is like a whole new world of things I’m trying to learn. Which, y’know, good and all, but, whoa, there’s some challenges.
Anyways, here are interesting things I learned today:
- For procps-based top, the %st CPU state means “stolen from the virtual machine”
Which is to say, the xen hypervisor is using some of your cycles (either because you’re idle, or because you’re only getting a slice of the machine — I think which it is depends on how accurate your copy of top is)
You know what else is fun: the man page for top didn’t admit there was such a thing as an %st state. Good times.
- dumbo, the python lib for writing Hadoop streaming jobs, is both very nice and horribly under-documented
And when I say under-documented, I mean source code with not only absolutely no comments, but also a… distinctive python style which is not altogether read-friendly.
And thus: when I was first learning how dumbo works, I was somewhat surprised to see that it offered the ability to write combine steps in python. Because all the Hadoop streaming docs said you couldn’t use combine with streaming.
How does dumbo do this? Well.. by holding the entire output of the map stage in memory, sorting it itself, and only then passing lines back to java.
Which, if, say, your data nodes are memory bound, has the potential to bring your entire system to its swapping knees. Which is what happened to me this afternoon.
So, um, beware of combine steps in dumbo, I guess.
-Dan M

