ChefAI

14 min readMay 8, 2021

Co-Authors: Gagan Kaushik, Samuel Yeboah, Will Worthington, Samuel Ochoa, Clark Poon, Vignesh Krishnamurthy, Deepanshi Sharma

Github: https://github.com/willworthington/ChefAI

INTRODUCTION

In this article, we provide our approach to generating novel cooking recipes using an unsupervised learning model to generate recipes. Our research focuses primarily on using GANs and NLP to generate text based ingredients and recipes. We created the list of ingredients by training a tabular general adversarial network to create viable ingredient lists. The training data was gathered from a Kaggle data set that contains over 7000 different ingredients and over 230,00 recipes. A NLP model called GPT2 generates the recipe steps using the provided ingredients from the tabular GAN. The model using qualitative and other task based metrics. Our goal was to produce recipes that are delicious and edible.

MOTIVATION

In college, deciding what to eat is always a stressful task given the various dynamics at play. On one hand, cooking food is cheaper but extremely time-consuming, while eating out is expensive but convenient. If you decide to cook food, you then deal with other issues such as: Do I have the right ingredients? Do I have a recipe? What recipe do I want to cook? We created a project that would ease the stresses of cooking by generating recipes on demand based on a list of predetermined ingredients.

BACKGROUND

The goal of this project was to use machine learning models to generate novel cooking recipes. For this project, we defined a recipe as a list of ingredients followed by a set of steps to prepare and cook the ingredients. Thus, we broke this definition into two subproblems: generating a list of ingredients and generating steps to prepare the ingredients. Ingredients can be represented as tabular data and therefore can be generated using a tabular general adversarial network (GAN for short). The steps, on the other hand, must be generated as novel English sentences. Thus, generating the steps is a natural language processing (NLP) problem. To generate the ingredients, we trained a tabular GAN model called CTGAN. To generate the steps, we trained a NLP model called GPT-2, and which would take a newly generated list of ingredients as input. Finally, we synthesized a realistic image of the recipe from the generated list of ingredients and steps using an attention-based generative neural network called CookGAN.

CTGAN

CTGAN is a conditional generator built on a generative adversarial network. It was designed to solve the problem of learning complex, non-Gaussian probability distributions from tabular data to generate new synthetic samples, a task that other statistical and neural network-based methods tend to have trouble with. It supports tables containing both continuous and discrete columns. Each column is treated as a random variable that forms an unknown joint distribution that the GAN will learn. Our project is to generate new recipes, and this means generating ingredient lists first. CTGAN was a good choice for us because our dataset containing ingredients (and steps) was in tabular format. CTGAN naturally allows for conditional generation and works with one-hot-encoding. If someone has a few ingredients that are about to go bad, they can generate a recipe using an ingredient list that is guaranteed to include the ones they already have. This is done by conditioning the generated sample on the existence of the previous ingredients. CTGAN was benchmarked by the authors of the paper and they found that it learns better distributions than Bayesian networks according to several benchmarks.

GPT

GPT-2 is a natural language processing model designed by researchers at OpenAI. OpenAI is a non-profit organization whose mission is to ensure that artificial intelligence benefits all of humanity. They developed GPT-2 as a “large transformer-based language model with 1.5 billion parameters.” It was trained on 40GB of Internet text to predict the next word given all previous words in some text (Radford). This functionality can then be applied iteratively to come up with the next word after that and so on. GPT-2 is very versatile, as developers can leverage transfer learning to train the model to perform specific tasks such as: generating a continuation of a given text, question answering, reading comprehension, summarization, and translation (Radford). For the purposes of this project, we trained it over 230,000 recipes with a given ingredient list followed by a list of steps. Our particular use was essentially a text continuation problem. To generate new steps, we would provide a list of ingredients as a prefix to the trained model, which would continue writing next words as it generated a list of steps to cook the ingredients. The specific implementation details will be discussed later on, but this high-level functionality of GPT-2 successfully helped generate novel recipes.

Food Image Generation

Cook Generative Adversarial Networks (CookGAN) is a model to generate a photorealistic meal image conditioned on a list of ingredients and steps. Prior work for image generation from text has focused on the assumption that the visual categories are well-structured, singular objects (such as faces or automobiles). However, meal images have more variable appearance given various food types and ingredients. To account for this variability, the framework relies on an attention-based recipe association model to extract ingredient features and determine their relative contributions. Trained on a dataset of ~ 1 million recipes with titles, instructions, ingredients, and images, CookGAN achieves photorealistic meal images for recipes with less than 20 ingredients and instructions.

DATA

Recipe dataset from Kaggle

We used a dataset of around 180k recipes that we found on kaggle.com. We chose to use this dataset because it contains many recipes and the providers of the dataset (Shuyang Li and collaborators) already did some preprocessing to determine which ingredients were used in each recipe. This made it relatively straightforward for us to feed the recipes and their ingredient lists into our model. The dataset can be viewed at https://www.kaggle.com/shuyangli94/food-com-recipes-and-user-interactions.

Data preparation for CTGAN

The Kaggle dataset contains extraneous information that we did not need (calorie level, recipe name, recipe index, and recipe id), so we started by removing all of this information from the dataset. Next, we needed to find a way to encode the ingredients as categorical data instead of numerical data. We decided to use one-hot encoding for this because it is a straightforward method to convert from numerical to categorical data.

When we attempted to train CTGAN on this data, we ran into two issues: CTGAN had trouble converging and CTGAN used up all of our RAM and crashed several computers that we tried to train on. We solved these problems by removing ingredients that did not appear frequently in the dataset (eg. appeared in less than 100 recipes). We then removed all recipes from the dataset that contained one of the removed ingredients. As a result, we reduced the dataset of ~178k recipes with ~8k ingredients to ~48k recipes with 579 ingredients. This solved the first problem by reducing the sparsity of the one-hot encoded ingredients matrix, which allowed CTGAN to converge more easily. This also reduced the amount of RAM required to train CTGAN a manageable amount because the dataset was smaller.

Possible improvements for data preparation

We also attempted some additional data preparation but were unable to incorporate it into our model due to constraints with time and computational resources. We attempted to find a method for reducing the number of ingredients without reducing the size of the dataset as much by finding similar ingredients and combining them into a single ingredient (eg. combining “macaroni noodle” and “small macaroni noodle”). We did this by creating a graph such that each node represented an ingredient and an edge between two ingredients of weight w indicates that the two ingredients share w common words. We then found clusters of ingredients by checking to see if they were a connected set in the graph. If a cluster was too large, we removed all of the weights in the cluster below a certain threshold and repeated the algorithm until there were no clusters with too many ingredients (eg. more than 40). This approach successfully generated clusters of ingredients that seemed related. For example, the one of the largest clusters created was:

dried mild red chili pepper
green chili pepper
dried red pepper flake
dried red pepper
ground red chili pepper
dried chili pepper flake
dried chili pepper
mild chili pepper
dried red chili pepper
crushed red pepper flake
ground red pepper
green chili pepper
dry red pepper
red chili pepper flake
red chili pepper flake
dried red chili pepper
green bell pepper
red chili pepper
ground ancho chili pepper
thai red chili pepper
dried red chili
red chili pepper
green bell pepper flake
green pepper flake
green chili pepper flake
red pepper flake
red pepper flake
chili pepper flake
dry crushed red pepper

Unfortunately, we were unable to integrate this process into our pipeline due to time constraints. Since this also would increase the size of the training dataset (by causing us to remove less recipes), we were worried that this would cause us to run into RAM constraints, so we did not prioritize getting this integrated into our pipeline.

Format CTGAN output to feed into GPT

In order for GPT to make use of the synthetic recipes generated by CTGAN, each of the synthetic recipes have to be converted from True/False values corresponding to the appearance of each ingredient in a recipe to the actual names of the ingredients based on their respective id’s. To do this, we utilize an ingredient map included as part of the Kaggle dataset to find the mapping of each ingredient to its id number. In order to reduce the complexity of our model, we consolidated many similar ingredients, such as “romaine lettuce leaf” and “iceberg lettuce leaf” to a single ingredient and id number, in this example “lettuce” with id 4308. After iterating recipe by recipe over the CTGAN output, and converting each True value to its corresponding ingredient and appending it to an array of ingredients for that particular recipe, we achieved a final results of ingredient lists for each recipe that can now be fed into GPT-2 to generate steps for cooking meals.

PROCESSING

Our goal was to create a machine learning pipeline that can generate a new recipe including the ingredients, steps and an image of the final product. To accomplish this we make use of CTGAN to generate a list of ingredients, GPT to come up with the steps and CookGAN to generate an image.

CTGAN

Generating the list of ingredients is a tabular generation problem that can be solved using CTGAN. Given a table of ingredients in a one hot encoded format, CTGAN can produce new synthetic rows of the data. The synthetic rows can then be converted to a new list of ingredients that should be a reasonable start to a recipe. We attempted training CTGAN with different sized datasets from 171k recipes to 30k recipes. It took a lot of experimentation to figure out how large of a dataset we could train on in a reasonable amount of time. We found that CTGAN could train on a 40k recipe subset with 300 epochs in about 8 hours so that is what we used to create our final model. Once the model was trained we could generate synthetic ingredient lists which could then be used by GPT-2.

GPT

We trained our GPT-2 model to take ingredients after the signifier “[INGREDIENTS]” and to return the name after the “[TITLE]” tag, followed by the steps after the “[STEPS]” tag. Below is one such example:

The model was trained on hundreds of thousands of text strings such as this one by fine tuning the pre-trained GPT-2 “355M” model. This is a medium sized model that has been sufficiently trained for our purposes. By using a pre-trained model, we were able to leverage the fact that the model could already write cohesive English sentences. Therefore, all we needed to do was finetune the model to perform a specific task. In our case, the model’s task was to take in a list of ingredients and generate a recipe name and the steps to cook it. The figure below shows how we generated a name and steps given a hard-coded list of ingredients as a baseline.

As shown, the GPT-2 model takes several input parameters to generate text. I will only discuss the most significant parameters to text generation. The length parameter indicates the number of words for the model to generate. Temperature is an interesting parameter which takes any value in the range zero to one, and it dictates the predictability of the next word generated. As temperature lowers, the model generates fewer interesting words, and at zero, it often generates no words. On the flip side, when temperature is maxed out to 1.0, the model can come up with some crazy (and often hilarious) steps. The prefix parameter is the prefix that the model generates text based off of. In our case, we pass in the ingredients list as the prefix, so the model then generates a title and list of steps. Nsamples is the number of samples the model should generate. We set the number of samples to ten because natural language processing is a complex task, and generating cooking steps simply makes it even more complicated. Thus, oftentimes the text generated doesn’t quite make sense, so we increase the odds of getting a strong result by generating several samples for the same input.

Food GAN

Finally, we applied generative deep models to synthesize a photo-realistic image of the recipe based on the list of ingredients and steps generated using GPT-2. This is accomplished by building on an attention-based ingredients-image association model, which was then used to condition a meal image synthesis GAN. To generate a meal image from a list of ingredients and steps, we first trained a model based on attention association to find a shared latent space between ingredient/step list and image. Then we used this latent representation to train a GAN to generate a meal image conditioned on the list. Ingredient and step features were extracted using cross-modal analysis to match an ingredient list and its corresponding image in a joint latent space. Ingredients and steps are one-hot encoded and processed through an attention mechanism to model the contribution of each ingredient and step. Two neural networks, one for the list and another for the images, are harnessed to embed the recipe list, its corresponding image, and an image from another recipe into the joint latent space. In this way, we are able to generate meal images from a list of ingredients and steps, which has interesting implications for nutritional health. Future work could involve extracting ingredient or calorie information from a meal image to monitor daily nutrition and manage diet.

RESULTS

Overall, our results look pretty good but have lots of variability. CTGAN successfully generates novel ingredient lists for each sample, showing that we have avoided mode collapse. However, the results are not always palatable to human tastes. Here are a few of our top ingredient lists:

[‘brown sugar’, ‘butter’, ‘sour cream’, ‘paprika’, ‘mushroom’, ‘italian seasoning’, ‘lean ground beef’, ‘allspice’, ‘french bread’]
[‘brown sugar’, ‘nutmeg’, ‘worcestershire sauce’, ‘sour cream’, ‘chicken breast’, ‘salt & pepper’, ‘vegetable stock’, ‘cayenne’]
[‘salt’, ‘pecan’, ‘butter’, ‘egg’, ‘sugar’, ‘garlic clove’, ‘onion’, ‘olive oil’, ‘garlic’, ‘bacon’]
[‘water’, ‘salt’, ‘nutmeg’, ‘black pepper’, ‘sugar’, ‘pepper’, ‘vanilla extract’, ‘garlic clove’, ‘milk’, ‘walnut’, ‘green bean’, ‘chicken thigh’]
[‘salt’, ‘potato’, ‘sour cream’, ‘ground beef’, ‘paprika’, ‘mayonnaise’, ‘mushroom’, ‘ground cinnamon’]
[‘salt’, ‘butter’, ‘salsa’, ‘pepper’, ‘onion’, ‘chicken breast half’]
[‘butter’, ‘egg’, ‘sesame oil’, ‘bell pepper’, ‘bread flmy’, ‘fresh parsley’, ‘green bean’, ‘chicken breast half’, ‘lemon’]
[‘salt’, ‘flmy’, ‘butter’, ‘egg’, ‘potato’, ‘baking powder’, ‘olive oil’, ‘tomato’, ‘milk’, ‘paprika’]
[‘salt’, ‘chicken broth’, ‘soy sauce’, ‘worcestershire sauce’, ‘sausage’, ‘tomato’, ‘parmesan cheese’, ‘mayonnaise’, ‘beer’, ‘tabasco sauce’, ‘dijon mustard’, ‘ground cinnamon’, ‘kosher salt’]

CTGAN does well synthesizing flavor profiles for ingredient lists. For example, it seems to recognize that garlic belongs in a lot of savory flavor profiles, and that butter belongs in a lot of baked goods and sweet dishes. However, it doesn’t seem to have learned that flavor profiles are generally mutually exclusive. Looking at list 3, it identifies “pecan, butter, egg, sugar” which could be great ingredients for some kind of dessert, but then also includes “salt, garlic clove, onion, olive oil, garlic bacon”. This suggests that CTGAN has learned certain flavor profile patterns, but has not yet learned all the rules of how flavor profiles interact within a recipe. If we had more time and resources, we would further explore this by training for many more epochs, and perhaps on a larger dataset.

GPT-2 did a remarkably good job of producing realistic recipes, although results were also highly variable. To evaluate the quality of the synthesized recipes, each person on our team rated a set of 25 recipes on a scale from 1–10 with the following criteria: 1 = complete nonsense, 5 = makes some sense but clearly wrong, 10 = humanlike. The average score was 5.76, with a standard deviation of 2.27.

This was the highest rated synthetic recipe, with an average score of 8.8:

[INGREDIENTS]: steak, mushrooms, green onions, potatoes, spinach
[TITLE]:
[STEPS]:
1. Place steak , mushrooms , green onions , and potatoes in a large , heavy-bottomed pot with enough water to cover
2. Cover and bring to a boil
3. Cook , stirring occasionally , for 5 minutes or until tender
4. Stir in spinach
5. Cook and stir for 2–3 minutes
6. Serve with potatoes , of course

This was the lowest rated synthetic recipe, scoring 3.2 on average:

[INGREDIENTS]: salt, potato, sour cream, ground beef, paprika, mayonnaise, mushroom, ground cinnamon
[TITLE]:
[STEPS]:
1. Cook potatoes according to directions
2. After steak is done , mix the onion , garlic , and bell pepper together
3. Add curry powder , salt , and pepper to taste
4. Cook the onion and garlic over medium heat , stirring occasionally , until it is soft
5. Stir in the sausage , paprika , and mayonnaise , and let the mixture come to a boil
6. Add the mushroom , cinnamon , and mushroom mixture and bring to a boil , stirring constantly , until almost completely cooked
7. Add the bread crumbs and stir to combine
8. Add the flour , paprika , and mayonnaise and stir to combine
9. Season to taste with salt and pepper
10. Add the sour cream and stir to combine
11. Place the stuffed shells on a large platter and sprinkle the remaining bread crumbs over the top
12. Serve with the sauce

Clearly there is a large gap between the best and worst recipes, but even the worst ones produce somewhat intelligible text that is clearly a recipe. Given more time, we would have explored how to prevent GPT-2 from producing recipes with ingredients that weren’t in the list provided to it.

CONCLUSION

Our project explored applications of GANs and NLP in generating recipes and synthesizing food images from a list of ingredients. Viable ingredient lists were generated using a tabular adversarial network called CTGAN, which supports one-hot-encoding and conditional generation. From this list of ingredients, generating a recipe procedure involved training a natural language processing model called GPT2 to predict recipe steps using the provided ingredients. Last of all, we visualized the results using CookGAN, a model that generates a photorealistic meal image from a list of ingredients and steps.

Computational food analysis is a growing application of computer vision due to its wide-ranging health implications, such as diet management, intake logging, and methodical meal preparation. In the future, we would like to explore the functional similarity of ingredients and meal preference forecasting. Ultimately, future development on computational food analysis will depend on the growth of better models for extracting food-related information from various mediums, from textual recipe descriptions to meal photos and videos.

REFERENCES

Radford, Alec. “Better Language Models and Their Implications.” OpenAI. OpenAI, February 14, 2019. https://openai.com/blog/better-language-models/.
Xu, Lei, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. “Modeling tabular data using conditional gan.” arXiv preprint arXiv:1907.00503 (2019).
Han, Fangda, et al. CookGAN: Meal Image Synthesis from Ingredients, 2020, openaccess.thecvf.com/content_WACV_2020/papers/Han_CookGAN_Meal_Image_Synthesis_from_Ingredients_WACV_2020_paper.pdf.
https://www.kaggle.com/shuyangli94/food-com-recipes-and-user-interactions

ChefAI

Written by Gagan Kaushik