Long ago (1988) I moved to Berkeley and started sending a monthly "newsletter" to my Boston friends. When I returned to Boston (1993), I continued the tradition for about five more years (or until I had kids). Looking back, I realize that I was actually blogging. Each newsletter contained anywhere from a few to several blog posts. Having been silent for the past decade or so, I've decided to resume these activities. Don't expect anything profound -- I tend to focus on what I find entertaining or amusing and perhaps sometimes informative. We shall see!

Friday, March 23, 2012

How I fell in love with the Turkers

I've been aware of Mechanical Turk for some time, probably since I volunteered to scan images in the amazing search for Jim Gray, but only recently did I have occasion to use it. I am working on a project that begins with the classification task, "Does this web page contain medical information?" Producing a classifier requires having a corpus that is already tagged. As I saw it, there were two options: I could bribe my students, friends, and co-workers with food, or I could try my hand at Mechanical Turk. Heck, what a great excuse to learn some new technology.

I decided that I'd drive my data pipeline in Python (it seemed another good thing for me to learn, and after all, sabbaticals really are about learning stuff). With my student, Elaine Angelino, as my Python tutor and some web surfing, I had a nice collection of tools (I'd like to thank Mitch Garnaat for boto and Jen Harvey for turkpipe).

For those of you unfamiliar with Mechanical Turk, there are two kinds of users: Requesters (those of us who have stuff we want done) and Workers (people who want to do stuff). I would primarily be a Requester and would be relying on Workers to classify my web pages. The unit of work that Workers do are called HITS, Human Intelligence Tasks. Requesters indicate how much they are willing to pay for each HIT and what kinds of qualifications they want their Workers to have. Requesters select HITS for which they are qualified.

My Mechanical Turk dabbling began with a few small data sets that I'd labeled myself. After having my IRB tell me that my use of the Turkers did not constitute research on people (which I knew, but I had to ask anyway), I nervously sent off my first job. I was amazed. At a whopping five cents per HIT, my 300 HITS were completed in about 10-15 minutes. And the accuracy was pretty good. I submitted each page three times and compared the best 2 of 3 classifications by the Turkers to my own hand labeling. Our agreement was roughly 80% and the points of disagreement were pretty consistent.

I submitted my second batch of HITS and not only did I get immediate turn around, but it turned out that one of my pages wasn't rendering correctly (it was clobbering Amazon's Mechanical Turk's header, so the workers could not accept the HIT, so they couldn't work on it). All three Turkers to whom it had been assigned sent me a note. Each note was polite and explained what was happening. Had they not told me, all I would have known is that some of my HITS hadn't been completed, and perhaps I would have been smart enough to log in as a worker and check them out (but perhaps not). I was really impressed -- these people who were doing some tasks for a nickel a shot all took the time to tell me there was a problem. I was truly grateful (and told them so). Some even replied to my thank you to let me know they'd be happy to test out other HITS.

I'm still tweaking my HITS a bit, but I am overwhelmingly happy with my Turkers. In less time than it took me to write this blog they already processed the HITS that didn't work before (there were about 15 of them). I may just have to figure out other research tasks for which Turkers can be helpful.

Saturday, March 17, 2012

Margo's Tips on Writing a Thesis

I advise students on writing theses. Sometimes these are my own students; sometimes they are students at Harvard working with other faculty; sometimes they are students who needed an external committee member. I figured that if I wrote down my philosophy about theses, then such students would know what they are getting into before they ask me. And maybe, others will find this useful as well.

I believe that it was my former student, Keith Smith, who first introduced me to the idea that every paper (and therefore every thesis) is a story. Before you write anything down, you'd better make sure you know what the story line is. The story line will help you pick your chapters, write your introduction, keep you focused, and potentially entertain you. Some students have a hard time thinking of their very serious research as a story. Somehow it feels demeaning to them. Well, I make it worse. I suggest they actually think about it in fairy tale terms -- that's right, start with, "Once upon a time," and end with "happily ever after." It may not be a morality tale, but ideally it will be a technical tale. So, the first thing I might ask a student inviting me to be on their committee, "What story are you telling?" Be prepared to answer that question.

In addition to being a story, a thesis is a piece of writing. As a piece of writing, it should be technically correct. That means that you apply the basic rules of grammar, you run spell check, you strive to make the writing as elegant as the research. This seems obvious, but you'd be amazed how many students think of writing as some secondary process. The best theses are those that allow me to focus and think about the ideas (and story line), rather than the fact that every sentence makes me want to pull out my red pen.

There are things that tie together these first two items. Avoid being redundant and saying the same thing more than once (yes, that was intentional). I know that you are trying to simply staple N papers together and call it a thesis, but as a reader, I don't need to read about the same piece of related work five times. I also don't need to be told twelve times that your system is the most wonderful piece of technology ever created. Think story line. Most things will fit nicely in your story line once; figure out where that one place is.

Now let's get to the nitty gritty. Introductions are frequently the most difficult things for people to write. If you follow the advice here, you will wonder why you ever thought the introduction was difficult to write. The sole purpose of the introduction is to get your reader from a standing start to the part of your introduction that reads, "The contributions of this thesis are..." What is a standing start? I tell my students to write for "a smart computer scientist." This means you can assume that terms such as algorithm, main memory, processor, file system, tree, linked list, database, etc are fair game. However, you should not assume that terms such as inode, B+*link-Tree, the semantics of Haskell, direct storage, a pass-through FUSE file system, etc. are widely understood. It's often useful to pick out a specific individual to whom you are writing. If you are a theory person or a formal languages type, just pin a picture of me up on your computer; that will keep you honest. If you are one of my students, pin up a picture of any of my esteemed colleagues in theory: Salil Vadhan, Michael Rabin, Leslie Valiant, Harry Lewis (I leave out Michael Mitzenmacher, because he dabbles in so many different fields, he may very well be a systems person in disguise). These people are all wicked smart, but probably haven't spent the past decade deep in the bowels of the Linux vnode layer (and yes, vnode is another term I'd recommend avoiding in the introduction).

So, you have a very smart reader and you want to give them just enough background and motivation so that when you drop your contributions statement on them, they think, "Wow, that's cool." instead of, "Why should I care?" or "What on earth doe that mean?" or, "Does that even matter?"

True confessions here: I never read the paragraph in most papers that says, "In Section 2, we motivate our study. In section 3, we provide background on existing approaches to our problem. Section 4 presents our approach in detail. Section 5 evaluates our approach and shows you how wonderful it is. Section 6 is the conclusion and it concludes our paper." However, in the thesis, you get to tell your story in short form. The chapter by chapter outline is actually the five-minute synopsis of your thesis. It tells the reader just how you are now going to walk them through your research so that they now understand not only what contributions you made, but how you made them, why you made them that way, and what interesting things you discovered along the way. You have the space -- each chapter gets its own paragraph. Do not cut and paste a paragraph from the introduction of the paper that you already published on the work in chapter N. Weave a description of the work presented in chapter N into your story line.

If you've done this well, you've done your reader an enormous service. If someone were holding your reader's child hostage, willing only to let the child go after the reader summarizes your thesis, the reader of a good introduction is in good shape. S/he can explain what you're doing, why you're doing it, what the big results are, and how you are going to convince the world of these things. There, now wasn't that easy?

So, let's wrap up the introductory chapter: It takes your reader from a standing start to a description of your contributions and then walks the reader through the chapter level outline of your thesis describing how it is that the chapters weave together to demonstrate the contributions that you are claiming.

Next, let's talk about related work. Contrary to popular belief, the purpose of the related work section is not to show that you've read a lot of papers. Instead, I like to think of the related work section as a work of art in which you construct a landscape that has a couple of blank white parts in it, and your research perfectly fills those blank spots. A good related work section walks the reader down a garden path such that at the end of it, the reader is left thinking, "Wow -- how could this problem have remained open so long?" or "It seems that this work is so obviously the right thing to be working on. Why am I not working on it?"

OK, so that's the goal, how do you accomplish it? In most theses, the related work section will break down into a few general areas. Figure out what those areas are. I tend to think both visually and hierarchically, which almost always results in my trying to cast my students' dissertations in terms of some multi-dimensional space. (Yeah, that's how my own thesis worked out, so perhaps it's the only thing I know?) If you can cast your work into some space that lets you place related work at particular points in the space and shows how there are regions in the space that are unexplored, and your work just happens to fall into those regions, then you're all set. If you can't do this, then you need to figure out how to place your work in context. What is the body of work out of which your work grew? What work inspired you? To which work should you be comparing your algorithm, approach, implementation? Once you've answered those questions you should know which work you want to discuss and ideally how to organize it. Then, when discussing the work, don't forget the related part. That is, rather than just say, "Peter, Paul, and Mary show that sorting is best done in linear time." explain how that fact relates to your own work. "While Peter, Paul, and Mary pioneered linear time sorting, we go one step beyond their work and show that in these special cases, sub-linear time is easy to obtain." You want to avoid a reader thinking, "Why did you tell me this?" Your prose needs to make it completely clear why the reader is wading through a discussion of work other than what is in your thesis.

You still may be struggling with related work, because you can't decide if it belongs early in your thesis or closer to the end. The reason you are having this struggle is because there are actually two kinds of related work. There is work that gives the reader sufficient background to understand what you're doing. If you're writing a file system thesis, then perhaps you need to explain to your reader what a vnode layer is or what an inode is. I like to think of this sort of related work as background. Then there is the second kind of related work, which I already discussed -- context setting. If you have a significant amount of background to discuss, then I'm a big fan of having a background chapter towards the beginning of the thesis and a related work section towards the end. The reason I like the context-placing related work at the end is that when you're making subtle (or even not-so-subtle) comparisons between your work and existing work, it's easier for the reader to understand it once s/he knows what you are actually doing.

It's a bit challenging to give detailed advice on the meaty chapters of the thesis, since those will vary tremendously from area to area, so I'll try to focus on a few of the things on which I always seem to comment and that seem to apply to a broad range of dissertations.

If your thesis includes algorithmic descriptions, then you're stuck deciding how to express those. Some people use pseudo code; others use real languages. The problem with pseudo code is that it's not precise; the problem with a real language is that the reader may be unfamiliar with the language's syntax. For example, I ended up reading a thesis that had many code samples in Haskell, a language that I really can't read. So, I asked the student to include a short tutorial. There is no single right answer, but you want your thesis to be approachable for as broad a range of reader as possible -- keep that in mind.

Similarly, if you need to present proofs, you need to use a syntax all your readers will understand. Don't assume that everyone reads every proof syntax the same way; define it. This is perhaps one way in which your thesis is quite different from a paper you submit to a conference that has a significant common vocabulary.

Another way your thesis differs from previous publications is that you have the luxury of space. This means that you should expound on some of the "why" questions that you may not have been able to do in shorter papers. Why is your architecture as it is? What else did you consider? Why did you choose what you did? (A lot of this needs to appear at least briefly in a good paper as well, but you get a lot more space in your thesis to explain these things.)

Similarly, there are no page limits, so you needn't cram all your figures into single column format. Make the diagrams, tables, and graphs nice and big so you can annotate them for easier comprehension, and so that your aging readers don't have to squint too much.

When you present performance results, make sure you explain why the results are what they are. In general, I like presenting performance results according to the following formula, "We ran this experiment and expected to see something, because of some reason. Figure X shows the actual results. As expected, on one part of the graph we observe the results we expected, but surprisingly we see that somewhere else the results are quite different. We ran additional experiments to help us understand these anomalous results and discovered something really interesting." Your most interesting results are often those where your experiment produced unexpected results.

Before wrapping up, let's talk about conclusions. Your final chapter needs to wrap up your thesis, come back to the original statement of contributions and now explain them in a bit more detail, since your reader now has both the context as well as an understanding of how you did something. This is where you can talk about the longterm implications or your results, which will lead gracefully into future work. You needn't only talk about work you could do, but how your thesis suggests work in other areas. I like to recommend that my students read thesis conclusions to look for good research projects. Make yours one I want my students to read.

I'll conclude by stating what should be obvious. Make sure you read your thesis through beginning to end to make sure that you introduce items before you discuss them, that you don't say the same thing four times, that the story line flows. Ideally, when you do that you'll be happy with the result. If you're not, perhaps you want to fix it before handing it over to those evaluating it!