Wednesday, January 9, 2013

Mechanical Turk - Conclusions

In this simple proof of concept I asked ten people to process a receipt, giving information about the business, date of purchase and the category of purchase. Some of this is clearly possibly to automate with OCR however the detail and accuracy required to make this useful makes this a good candidate for a Mechanical Turk task.

So how did people do?

Well it took about 20 minutes for all 10 HITS to be processed. That's not too bad, and I'm sure response time is highly sensitive to price. I suspect 11 cents for each of these hits is on the high side since the time it takes to process one of these is probably 30 seconds or so at most. That said this is an interesting design decision you have to make, it's pretty clear that response times are not going to be very predictable, and definitely not particularly close to real time unless you pay a significant premium. If you really have a popular application and people get to know your tasks (something you can easily do with MTurk) then you might begin to achieve more consistent and quicker results. Especially since the system allows you to tip particularly diligent workers.

As for accuracy, all ten people correctly identified the business and 9 out of 10 got the date right. The one person who didn't reversed the month and day despite explicit instructions in the HIT to watch out for this. The context of the receipt should have been pretty clear it was from the US, but I suppose its possible the actual mistake made was thinking this was a non-US receipt and my instruction led to confusion that it should be switched.

As for categorization, it was all over the place, I would rate 3 out of 10 as correct answers (something along the line of hardware or home improvement). Many people put food, although there was a food item it was only a small part of the total and the directions were to categorize the most significant part of the purchase. In a full up application I would need to implement a drop down menu or multiple choice mechanism to get consistent categories but the issue of 5 out of 10 categorizing as food is a fundamentally different problem. I will need to experiment with better directions to see if I can improve this accuracy.

MTurk provides a way to have a second person validate the result of the first person. This of course increases cost because it's a separate HIT and would lower response time as well but it appears to be a necessary step in quality control. MTurk also lets you "qualify" people which might help in this case, for instance I need people who understand english well enough to decrypt the extreme abbreviations that are on some receipts and can use context like this was a hardware store receipt to help with categorization.

One fascinating idea might be to try and preprocess receipts with OCR and use MTurk for confirmation and correction instead. Recall that the goal of this proof of concept is to develop a system that can accurately and automatically categorize receipts for a program like quicken or mint and make it easier for people to use that information for more detailed budgeting and money management.

Overall I think MTurk presents some unique capabilities that have not been widely exploited especially in consumer applications. However there are challenges and being able to afford MTurk even for just a few cents per transaction will be a barrier for many possible applications.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.