Using an LLM to extract machine readable data

This is a story about one of the rare few times in recent years that I’ve been truly amazed by technology. I’ve been noodling around with scraping recipes from the web for a side project. This has been surprisingly easy thanks to the widespread adoption of the JSON-LD format in recent years.

For instance, if we dive into the source for this egg fried rice recipe, we find a script tag with the type application/ld+json that includes structured JSON content.

{
    "@context": "http://schema.org",
    "@type": ["Recipe"],
    "name": "Perfect Egg Fried Rice (On Whatever Gear You Have) Recipe",
    "cookTime": "PT10M",
    "nutrition": { ... },
    "recipeCategory": ["Entree"],
    "recipeCuisine": ["Chinese"],
    "recipeIngredient": [
        "3 tablespoons (45ml) vegetable or other neutral oil",
        "3 large eggs, thoroughly beaten",
        "1 1/2 to 2 cups (about 1 pound/450g) cooked rice (see note)",
        "2 teaspoons (10ml) light soy sauce",
        "3 scallions, thinly sliced crosswise"
    ],
...
}

The JSON-LD schema gives us a clear picture of all of the fields that make up a recipe and allows us to parse the data using a number of libraries. Websites are incentivised to provide this data to allow search engine scrapers to get a better understanding of their pages and rank them higher. This data is also used by google to provide richer previews of the most relevant results like the one pictured below.

Despite all of this recent innovation, there’s of course a long tail of ancient recipe blogs that haven’t adopted the standard yet. Extracting recipes out of these was proving fairly tricky and I had initially anticipated having to either develop good heuristics to find a recipe within a blob of HTML or dive further into ML land.

LLMs to the rescue

I was dreading the task until last night, when I had the idea to try and get a large language model (LLM) to generate JSON-LD structured data from the scraped source. Using one of OpenAIs GPT-3 models, I was successfully able to get a machine to ~~extract~~ generate machine readable data.

With the simple prompt: The following is a web scrape of all text content from a recipe blog, extract the recipe in JSON-LD format. "<<content>>" GPT is able to generate amazingly accurate structured data for my favourite cheese scone recipe.

{
  "@context": "http://schema.org/",
  "@type": "Recipe",
  "name": "Cheese Scones",
  "cookTime": "PT15M",
  "prepTime": "PT10M",
  "totalTime": "PT25M",
  "image": "https://lifeandcheesedotcom.files.wordpress.com/2012/06/scone1.jpg",
  "description": "This is the best and easiest cheese scone recipe I know. It comes from a cafe in Wellington – Ministry of Food. It's close to our Parliament buildings and I got addicted to these scones when I was working near by.",
  "recipeIngredient": [
    "2 cups flour",
    "2 teaspoons baking powder",
    "1 teaspoon sugar",
    "1/2 teaspoon salt",
    "pinch cayenne pepper (optional)",
    "2 cups grated tasty cheese",
    "1 cup full-cream milk"
  ],
  "recipeInstructions": [
    "Preheat oven to 220º C. Put the tray in to heat.",
    "Sift flour, baking powder, salt, sugar and cayenne pepper into a bowl. Mix in cheese. Make a well in the centre.",
    "Check you have a clear bench with flour on it, the oven is fully heated. You want the scones to go into the oven as soon as possible.",
    "Add the milk, mix quickly with a knife. Turn out onto a floured bench and quickly shape into a 3 cm thick rectangle. Cut into 8 and put on to your floured, hot tray. Bake for 15-20 minutes. You'll smell them as they get ready.",
    "Delicious! If you don't eat them all they freeze well. To reheat put them into an oven for about 5 minutes."
  ]
}

Just like that. Valid JSON-LD structured data. The model even provided us with a relevant image (which is real and works) and was able to ascertain reasonable prep and cook times from the instructions. Trying a few more recipes produces similarly amazing results, even when the input was an ancient recipe blog with unclear formatting.

Problems and next steps

As amazing this is, the limiting factor here is the token limit imposed by current models. The models currently available to us have a limit of 4097 tokens that need to be shared between the prompt and response.

We can run into this limitation quite easily since we’re including raw HTML as part of the prompt. Scrapes from recipe blogs can also be very large as they often include a tonne of advertisements, comments and links to generate traffic.

While the latest model (GPT-4) would extend this limit to 32,768 tokens, another approach might be to get back to heuristics for now to extract the main content from the scrape before including it in the prompt. This would also reduce the cost since most of the models bill at a fixed rate per 1000 tokens.

Stay tuned for Part 2 where I experiment with content detection 🚀