RSS Dolores Labs Research


Bing is an Improvement over Live, but Still Not Google Quality: Evaluating Bing With Mechanical Turk

June 10th, 2009 by Lukas Biewald

Microsoft’s new search engine, Bing, has recently gotten a lot of attention. Several people have already built tools to compare Google and Bing.

Since all the engines are fairly similar, it’s hard to separate true quality from our preconceptions. For example, one of Google’s internal tests is reported to have shown that “users still prefer the results with the Google logo, even if they’re not Google results.”

Are the new Bing search results really better than the old Live search results? Are they better than Google?

We took 100 random real-world queries and showed their results from each engine to workers on Mechanical Turk. For a single query, we showed the results from two engines side-by-side and asked workers to judge which result set was better. For each query, here’s the aggregate judgment from several workers:

Bing versus Google

Bing (Microsoft today) versus Live (Microsoft as of March)

Summary

We found that Google is statistically significantly preferred to Bing (p < 0.04), though the difference is rather small: Google is preferred on 55 percent of the queries, and on average it scores two tenths of a standard deviation better than Bing. (0.141 on a four-point scale.)

On the other hand, we found that users preferred Bing's new results to the older Live search results 55% of the time. But this result wasn't statistically significant -- they're virtually tied in aggregate.

In conclusion, Bing's quality seems to be improving, but hasn't yet caught Google. Of course, relevance is just one component of a search engine user experience, and it's clear that all the major engines are quite close, and there exist a large set of queries where Bing significantly outperforms Google.

Details

First, Read the rest of this entry »

Amazon Mechanical Turk/Crowdsourcing Work Meetup

May 22nd, 2009 by Lukas Biewald

Dolores Labs OfficeJune 10th, 6PM - we’re having a Mechanical Turk meetup at our office!

We have a great lineup of short talks:
Bob Carpenter (Alias I, Inc.) - Failure study of biology task
Rion Snow (Stanford) - The abundance of the stimulus: exploding data acquisition bottlenecks with the Turk firehose
Alexander Sorokin (UIUC) - Generic Web-Based Toolkit for Mechanical Turk
Mikhail Seregine (Jambool) - Architecture of AMT-based Shopping Engine
Lilly Irani (UC Irvine) - Turkopticon: Rating Requesters

We also have John Hoskins and Sharon Chiarella (Amazon VP of Mechanical Turk) coming down from Seattle to join us and answer questions.

I’m excited to find out what everyone is up to - and I’d especially like to invite our blog readers to come. If you would like to join us, please RSVP on our meetup page.

The Programming Language with the Happiest Users

May 12th, 2009 by Lukas Biewald

Which languages make programmers the happiest? It’s clear that some languages are more popular than others, and many of us debate long and hard over the relative merits of Python vs Ruby, C vs Java or Lisp vs everything else. But what’s the general consensus?

I decided to do a little market research. I scraped the top 150 most recent tweets on Twitter for the query “X language” where X was one of {COBOL, Ruby, Fortran, Python, Visual Basic, Perl, Java, Haskell, Lisp, C}.

Then I asked three people on Amazon Mechanical Turk to verify that the tweet was on the topic. If so, I asked if the tweet seemed positive, negative or neutral. You can try the task for yourself at http://crowdflower.com/judgments/mob/592.

Whenever you judge sentiment, there are lots of tricky cases. The tweet, interesting idea and a new cool language, unlike old boring Lisp:), seems negative towards Lisp, but the emoticon makes me think that the person may actually like Lisp. The tweet, Lisp … remains an influential language in “key algorithmic techniques such as recursion and condescension” could be construed as positive towards Lisp, but could also be construed as negative.

On the other hand, many tweets are very clear, such as, Once again I find myself battling with Haskell. Why oh why create such a language? or The more I learn about Haskell, the more impressed I am with the language. Nothing like intentional infinite recursion, mind = blown.

Without further ado, here are the results:

I am not surprised that COBOL was the least favorite, but I am somewhat surprised that Perl was the favorite. Sifting through the data, unlike other languages, there seem to be a surprising number of people that just felt like giving Perl shoutouts. More than other languages it has tweets like My favorite scripting language is Perl, which I can do in Linux, but right now, I need powershell. or I seriously love perl. Is there any better research language? truly?.

Clearly there are a million caveats. Are Perl or Lisp users happier people in general? Are the tweets ever from users? How do the people using Twitter differ from the population at large. And perhaps the particular day we scraped had an effect.


Notes:

  • There’s a great website langpop.com that does some nice analysis of relative language popularity.
  • The “C” query combines C++, objective-C, C and C#. It would be nice to split this out in future work.

Updates:

Just to answer a couple criticisms — C, C++, etc. were combined not because I don’t consider them to be completely different languages, but because it was difficult to search for just C or C++ or C#. I thought about taking it out completely, but figured why not show the data. I left out php because it was matching tons of webpages like example.com/home.php. I should have included JavaScript. For the record, the language I use the most these days is probably R, which I also forgot to include :).

I think most of the criticisms about the validity of the data are reasonable. What we try to do with these blog posts is not peer reviewed science, but quick and dirty data exploration (see our blog’s manifesto: http://blog.doloreslabs.com/2008/03/the-manifesto/). I am not trying to imply that any language is better than any other language, just to get a rough measure of the sentiment out there on Twitter.

We’ve moved!

April 30th, 2009 by Stephanie Geerlings



Dolores Labs has moved. Whoosh! We are now located at 83A Wiese Street near 16th and Mission.

The best two things to do when you move are:

A) Something perverse, like drop a large fruit on your own floor.
B) Something collaborative, or at least play host to people who do good, symbiotic work.

We succeeded in both ventures, before the end of April, once again proving our agile abilities.

The dropping of a watermelon symbolically celebrates the market, masterfully set into motion by our own Chief Executive. (I made that up just now, pretty good, eh?)

More importantly, and for the betterment of your Web community, we hosted a JavaScript meetup. Thank you Matt for handling all of the logistics! We had a wonderful time. Hope to do it again sometime.

The floor is still a little sticky from the watermelon, and possibly from the JavaScript aficionados. ;)

Feel free to visit us. Our new office space is open to entrepreneurs and free-lancers who need a space to focus, new and existing customers, and Altay Guvench. All others please call ahead on the phone line that has yet to be installed. We can’t be sure when the sky might be failing or for how long the caffeinated soda pops will be in large supply.

We’re looking to grow!

March 14th, 2009 by Chris Van Pelt


Why work at fancy VC-backed startups with on-site chefs, when you could work at Dolores Labs with our unlimited Otter Pop policy? Despite what this blog might lead you to believe, we have a business model, our revenues are growing and we have some money in the bank. We were recently featured in Forbes and CACM.

We’re looking for a developer, a sales/BD person and an intern or two. You can see that Chris wrote the Developer description and Lukas wrote the Sales/BD description — we won’t even try to merge the two into the same style :).

Developer

We’re looking for a versatile backend developer that loves Ruby, Postgres, and scaling. Here are the details:

Skills that are a must

Ruby: we’re a Ruby shop, you’d best know Ruby.
Unix: our servers are Ubuntu, our workstations are f-ing unibody MacBook Pros.
MVC Web frameworks: we use both Merb and Rails.
Deployment: dealing with Nginx, God or Monit, and Thin or Mongrel using Capistrano or telnet…
API’s: experience with all that RESTful hotness and ways to authenticate it.
Scaling: distributed queues, load balancing, redundancy, memcached, etc.
Relational DB: we love Postgres, we don’t hate MySQL, we always need to optimize.
Testing: you understand its importance but aren’t all military about it.
GIT: that’s right, none of that non-distributed version control nonsense.

FTW

Statistics: you either know it or you love it, cuz we do :).
Editor: you can do some crazy unbelievable stuff in either VI or Emacs. I use TextMate :()
Front-end knowledge: you know where that front-end person is coming from.
Crowdsourcing: you’ve played with Mechanical Turk.
Taste: you like The Wire, you’re into Lost, you love Shawshank Redemption and Edward Norton.
Location: If you don’t live in SF, having lived on a street called Dolores at some point in your life helps. ;)
Humor: you have a sense of it.

Sales/Business Development

We’re also looking for someone to help with/lead our sales and business development. We have a solid product and pipeline full of deals that need to be closed. This is a chance to come in as the first non-engineer and take our business to the next level.

Responsibilities

  • Close deals with large companies
  • Own and manage the deal pipeline
  • Develop initiatives designed to grow the business in new areas, and identify areas to drive future improvement
  • Work with internal teams to ensure organizational understanding of partner product strategy, marketing initiatives and other partner needs
  • Research prospective business partners and assess competitive landscape of potential business development partnerships

Requirements

  • Proven ability to navigate and close deals with large companies
  • Background and success in building and tracking a robust sales pipeline
  • Proven track record in architecting complex deals
  • Exemplary analysis and problem solving skills from both a strategic and tactical level
  • Proven experience negotiating contracts and managing outside council

Experience with crowdsourcing or traditional outsourcing a plus.

Finally, we’re also looking for both engineering and non-engineering interns.

If this sounds like you, please send us a brief description of yourself along with a resume to jobs@doloreslabs.com.

Age and Gender Stereotypes

February 9th, 2009 by Lukas Biewald

A while back we built the website FaceStat, where you can upload a picture of yourself and find out what kind of first impression you would make to a stranger on the internet, and also judge others in kind.

To date, we’ve collected more than ten million judgments on over one hundred thousand faces. On a lazy Saturday afternoon, we finally dumped the data and played around with it.

Aggregating millions of these snap decisions tells us a lot about our own biases in surprising ways.

For example, you might think that 20-year-olds would be judged as most attractive. However, in this data babies are most attractive, with another peak around 26. After a dip from 40-50, attractiveness starts to increase again.

We have far more data on people between 18-40 on our website, which explains the tighter error bars.

Women are judged as much more trustworthy than men, with the lowest scores for adolescent males. Interestingly, there is a large jump in trustworthiness for both men and women between 20 and 30, and between 50 and 60:

Read the rest of this entry »

Crowdsourcing the origin of the word “Crowdsourcing”

February 9th, 2009 by Lukas Biewald

A few weeks ago, Steve Jurvetson sent me an email asking me if I knew the origin of the term “crowdsourcing”. Steve had been contacted indirectly by William Safire, author of the “On Language” column for the NY Times, who was looking for the first use of the term.

I had always thought that the term came from Jeff Howe’s June 2006 Wired Article , but in fact Jeff Howe credits Steve with the term from an earlier post on flickr.

A quick search on Google or the Internet Archive doesn’t turn up anything earlier than Steve’s post. But I thought it would be a perfect question to crowdsource. So I posted a task on Mechanical Turk to find the earliest use of the term.
Read the rest of this entry »

Turkopticon

February 4th, 2009 by Lukas Biewald

One of the most important things we do at Dolores Labs is track the reputation of the workers on Amazon Mechanical Turk, so we know who is trustworthy and who is giving us questionable data.

But how do the Turkers (the people doing the work on AMT) know which requesters are trustworthy? In some ways it’s even more important for Turkers to know who is trustworthy, because requesters are allowed to refuse payment for a job with no recourse.

At Dolores Labs, we aspire to be as fair as possible, and only refuse payment when a Turker is giving us completely worthless results. But since our system is automated and does a large volume of tasks, we’ve surely made some mistakes.

Turkers can complain on message boards about bad requesters, but we’ve always felt it would be good for Turkers and the AMT marketplace to be have better information about requester’s reputation. So it was a pleasure to help my college friend Lilly Irani and her colleague Six Silberman to build Turkopticon, a Firefox plugin that lets Turkers report bad requesters.

Anyone can download the Turkopticon plugin here, and they will be able to see all complaints that have been filed against a requester embedded inside the AMT interface. I would encourage all Turkers to download the plugin and help make Mechanical Turk a more transparent marketplace!

We helped Lilly and Six collect the seed data on which requesters were good and bad by creating a Turk task for them. We asked three questions, based on the complaints that we most often see: “How fair has this requester been in approving or rejecting your work?”, “How promptly has this requester approved your work and paid you?”, and “How well has this requester paid for the amount of time their HITs take?”.

It’s nice to see that the majority of requesters were reviewed positively:

Read the rest of this entry »

Judging a stranger by their tweets continued…

January 23rd, 2009 by Lukas Biewald

In our last post we put up a visualization of how Turker’s rated the top 200 twitterers, but some people asked for a table so here it is:

If you have ideas for other interesting questions to ask, let us know.

Read the rest of this entry »

Judging a stranger by their tweets

January 22nd, 2009 by Lukas Biewald

In previous posts, we looked at what happens when people judged strangers by their pictures. This time we looked at how people would judge strangers by their twitter feeds.

We took the top 200 twitterers from http://twitterholic.com/ and asked people on Amazon Mechanical Turk to judge how smart, interesting, trustworthy, etc. they thought they were.

Here’s a scatterplot of the top twitterers showing the smart vs. interesting values. You can see someone like Matt Cutts was thought to be smart but not entertaining, while Jimmy Fallow was judged as entertaining but not as smart.

If you mouse over a feed you can see what turkers said in the free response field.

Read the rest of this entry »