Code for paper "MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge"

Code for paper "MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge"
Abstract: Autonomous agents have made great strides in specialist domains like Atari games and Go. However, they typically learn tabula rasa in isolated environments with limited and manually conceived objectives, thus failing to generalize across a wide spectrum of tasks and capabilities. Inspired by how humans continually learn and adapt in the open world, we advocate a trinity of ingredients for building generalist agents: 1) an environment that supports a multitude of tasks and goals, 2) a large-scale database of multimodal knowledge, and 3) a flexible and scalable agent architecture. We introduce MineDojo, a new framework built on the popular Minecraft game that features a simulation suite with thousands of diverse open-ended tasks and an internet-scale knowledge base with Minecraft videos, tutorials, wiki pages, and forum discussions. Using MineDojo's data, we propose a novel agent learning algorithm that leverages large pre-trained video-language models as a learned reward function. Our agent is able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward. We open-source the simulation suite and knowledge bases (this https URL) to promote research towards the goal of generally capable embodied agents.

MineCLIP: Foundation Model for MineDojo

MineCLIP is a novel agent learning algorithm that leverages large pre-trained video-language models as a learned reward function. Guided by MineCLIP reward, our trained MineAgents are able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward. MineCLIP is trained using MineDojo video dataset.

Specifically, MineCLIP model is a contrastive video-language model that learns to correlate video snippets and natural language descriptions. It is multi-task by design because it is trained on open-vocabulary and diverse English transcripts. With the trained MineCLIP model in the hand, we train language-conditioned MineAgents that take as input raw pixels and predict discrete control.

In this repo, we provide model code for MineCLIP, MineAgent, and env wrappers for dense reward shaping and sample envs used in our paper. This codebase is under MIT License.

Installation

MineCLIP requires Python ≥ 3.9. We have tested on Ubuntu 20.04. Installing MineCLIP codebase is as simple as:

pip install git+https://github.com/MineDojo/MineCLIP

MineCLIP

We provide implementation for two MineCLIP variants, mineclip_attn and mineclip_avg. Here are demo scripts you can run both variants.

python3 main/mineclip/run.py variant=attn|avg

Choose one variant from attn and avg. If everything goes well, you should see Inference successful printed out.

Pretrained weights are provided for attn and avg. You can run demos below to load pretrained weights.

python3 main/mineclip/load_ckpt.py variant=attn|avg ckpt.path=PATH/TO/DOWNLOADED/CKPT

Choose one variant from attn and avg and specify path for downloaded weights. If everything goes well, you should see Successfully loaded ckpt printed out.

MineAgent

We provide implementation for MineAgent model. To run a MineAgent that takes a single-step observation and outputs a single-step action, run

python3 main/mineagent/run_single.py

If everything goes well, you should see Inference successful printed out.

To run a MineAgent demo with an environment in the loop, execute

python3 main/mineagent/run_env_in_the_loop.py

If everything goes well, you should see a Minecraft client pops up and an agent fight against a spider (though currently it is a random policy).

Dense Reward Shaping

We provide env wrappers for dense reward shaping used in our paper. Specifically, we provide dense reward shaping for tasks in groups of animal-zoo and mob-combat. Corresponding wrappers can be found here and here.

You can also find two sample env implementations HuntCowDenseRewardEnv and CombatSpiderDenseRewardEnv. They correspond to tasks "Hunt Cow" and "Combat Spider" in the paper.

Paper and Citation

Our paper is posted on Arxiv. If you find our work useful, please consider citing us!

@article{fan2022minedojo,
  title   = {MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge},
  author  = {Linxi Fan and Guanzhi Wang and Yunfan Jiang and Ajay Mandlekar and Yuncong Yang and Haoyi Zhu and Andrew Tang and De-An Huang and Yuke Zhu and Anima Anandkumar},
  year    = {2022},
  journal = {arXiv preprint arXiv: Arxiv-2206.08853}
}

Download Source Code

Download ZIP

Paper Preview

Aug 17, 2022