nate koechley's blog

http://nate.koechley.com

There’s a million APIs out there, and I couldn’t be happier. It’s easy now to translate street addresses to lat/long coordinates. It’s easy to grab local results, and overlay them on a map. It’s easy to use Yahoo or Google to get all types of search results (local, images, etc), and sites like Amazon to get prices and products.

But I think one of the coolest and most underrated APIs is the Term Extractor API from Yahoo!:

In other words, you point it at a piece of content — a news article, blog post, movie review or whatever — and it returns a list of terms, or keywords (or “tags” for those of you keeping score at home).

What do you do next with a list of keywords from a piece of content? Well, lots of things. Jeremy Keith wrote yesterday about a few ideas (that seem up for grabs, if you’re in a hacking mood!).

What if you treated each returned term as a tag? You could then pass those tags to any number of tag-based services, like Flickr, Del.icio.us, or Technorati.

So, instead of the simple “here’s my Technorati profile” or “here are my Flickr pics” on a blog, you could have links that were specific to each individual blog post. If I sent the text of this post to the term extractor, it would return a list of terms like “api”, “yahoo”, etc. By passing those terms as tags to a service like Technorati or Del.icio.us, readers could be pointed to other blog posts and articles that are (probably) related.

Like he suggests, it gets interesting when you let the output from this web service be the input for another service. I was lucky enough a few months ago to lend a small bit of help to the team that brought you the Yahoo! Events Browser mashup. One challenge of that product was to get images associated with each event. If you’ve ever worked with unstructured data — event listings are super unstructured — then you know that they don’t provide many high-quality hooks for understanding their content. The team tried doing image searches on venue or artist name, but the results weren’t very relevant or interesting, even when the parsed venue or artist was accurate. So, being the put-lots-of-pieces-together types there are, they decided to use the Term Extractor to discover more accurate, meaningful, and specific query terms to then find images for. Here’s how they summed it up:

To display appropriate images for events, local event output was sent into the Term Extraction API, then the term vector was given to the Image Search API. The results are often incredibly accurate.

I’ve only seen a handful of implementations of the Term Extractor API so far. If you’ve got a cool one to point me to, or a cool idea for a future implementation, please leave ‘em in the comments below.

16 Responses to “Most Underrated API? The Yahoo! Term Extractor”

  1. I did some work with the Yahoo! term extractor to use it for tags, and it can be a bit “noisy”, to the point that some data checking had to be done in order to ensure that the quality of tagging was high.

    Tagyu seemed to give me better results, however ymmv.

    My stuff/work/thoughts here.

  2. I’m looking into an events listing for a UK charity, Yahoo! doesn’t currently stretch this far (I think). All this stuff seems US centric at the moment, is there anything UK based yet?

  3. I used the term extractor on wikipedia articles as a way of enriching the linking in some data I was processing: writeup

  4. Hey Simon,

    It’s true that many Y! products get launched in the States first. Partially because that’s where the bulk of the developers are, and secondly because content acquistion — getting the data from a source — is often on a regional scope, and generally unstructured which means much post-processing has to be done.

    For events, I recommend you check out Upcoming.org (a recent Yahoo acquistion). It’s an open, so-called “Web 2.0″ type site, and let’s you list events from anywhere about anything. One you add a few buddies to the service and join a few groups, it starts getting pretty awesome. Here’s their “metro” page for London:

    http://upcoming.org/metro/uk/london/london/

  5. We’ve gotten some cool uses of the term-extractor API for Post Remix (the Washington Post’s mashup site). For instance, Ripped from the Headlines and Amazon Light. I think NewsCloud may be using it, but I’m not sure.

  6. Yeah NewsCloud does use the Yahoo term-extractor API to look for keywords it hasn’t seen before. After it knows about a keyword however, it just looks for the keywords in the content itself, because Yahoo doesn’t give you a frequency count for each keyword.

  7. I use the term extractor API on a social predicting website and it works quite nicely.

    I pass the prediction text onto the T/E API and use the results from that to call the Yahoo Image and News APIs. It works nicely most of the time but is not without its quirks.

    Take a look http://www.twocrowds.com

  8. [...] Today I came across a post about Yahoo! Term Extractor API by Nate Koechley. This can result into something that will not only benefit the readers but also the bloggers. In addition to ensuring that no terms are missed, it can fully automate discovery of related posts/articles on tag-based services like Technorati. And coming from Yahoo! it is very much usable in PHP, and so compatible with Wordpress!? [...]

  9. Ritwik Banerjee May 16th, 2006 - 4:28 am

    Nice post … especially close to me because I was working on a similar thing (just two of us) when the Yahoo! extractor came into being ……

  10. [...] yahoo api term extractor article Term extract documentation from Yahoo Share:These icons link to social bookmarking sites where readers can share and discover new web pages. Filed under Web by admin. Permalink • Print • Email [...]

  11. Hi everybody!
    TermExtractor, my master thesis, is online at the
    address http://lcl2.di.uniroma1.it.

    TermExtractor is a software package for Terminology
    Extraction. The software helps a web community to
    extract and validate relevant domain terms in their
    interest domain, by submitting an archive of
    domain-related documents in any format.

    TermExtractor extracts terminology consensually
    referred in a specific application domain. The
    software takes as input a corpus of domain documents,
    parses the documents, and extracts a list of
    “syntactically plausible” terms (e.g. compounds,
    adjective-nouns, etc.).
    Documents parsing assigns a greater importance
    to terms with text layouts (title, bold, italic,
    underlined, etc.). Two entropy-based measures, called
    Domain Relevance and Domain Consensus, are then used.
    Domain Consensus is used to select only the terms
    which are consensually referred throughout the corpus
    documents. Domain Relevance to select only the terms
    which are relevant to the domain of interest, Domain
    Relevance is computed with reference to a set of
    contrastive terminologies from different domains.
    Finally, extracted terms are further filtered using
    Lexical Cohesion, that measures the degree of
    association of all the words in a terminological
    string. Accept files formats are: txt, pdf, ps, dvi,
    tex, doc, rtf, ppt, xls, xml, html/htm, chm, wpd and
    also zip archives.

    I’d like if you partecipate in the TermExtractor
    evaluation task. The result of your evaluation will be
    put in a paper (I enclose a draft). Please contact me
    if you want to partecipate (this is very important for
    me!).

    MANY THANKS!!!


    Francesco Sclano
    home page: http://lcl2.di.uniroma1.it/~sclano
    msn: francesco_sclano@yahoo.it
    skype: francesco978

  12. [...] Today I came across a post about Yahoo! Term Extractor API by Nate Koechley. This can result into something that will not only benefit the readers but also the bloggers. In addition to ensuring that no terms are missed, it can fully automate discovery of related posts/articles on tag-based services like Technorati. And coming from Yahoo! it is very much usable in PHP, and so compatible with Wordpress!? [...]

  13. We used Yahoo term extractor for our World News website. It works like a charm.

  14. [...] data gets far more interesting when attached to people. Why settle for the results of Yahoo Term Extractor, when we can attach highly structured data from sources like Flixster, iLike and [...]

  15. [...] reading Nate Koechley article on Yahoo’s term extractor API i was inspired to connect it to actsastaggableonsteroids. [...]

  16. Hello nate,

    Thanks for writing this article. I came across it through stumbleupon. It inspired me to combine Yahoo’s term extractor with ruby on rails tagging plugin. Not so underrated anymore ;)

Leave a Comment

Comments are held for moderation so I can manually delete spam not caught by the filters.