Wav2Lip/README.md

# **Wav2Lip**: *Accurately Lip-syncing Videos In The Wild*

This code is part of the paper: _A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild_ published at ACM Multimedia 2020. 

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-lip-sync-expert-is-all-you-need-for-speech/lip-sync-on-lrs2)](https://paperswithcode.com/sota/lip-sync-on-lrs2?p=a-lip-sync-expert-is-all-you-need-for-speech)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-lip-sync-expert-is-all-you-need-for-speech/lip-sync-on-lrs3)](https://paperswithcode.com/sota/lip-sync-on-lrs3?p=a-lip-sync-expert-is-all-you-need-for-speech)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-lip-sync-expert-is-all-you-need-for-speech/lip-sync-on-lrw)](https://paperswithcode.com/sota/lip-sync-on-lrw?p=a-lip-sync-expert-is-all-you-need-for-speech)

[[Paper]](http://arxiv.org/abs/2008.10010) | [[Project Page]](http://cvit.iiit.ac.in/research/projects/cvit-projects/a-lip-sync-expert-is-all-you-need-for-speech-to-lip-generation-in-the-wild/) | [[Demo Video]](https://youtu.be/0fXaDCZNOJc) | [[Interactive Demo]](https://bhaasha.iiit.ac.in/lipsync) | [[Collab Notebook]](https://colab.research.google.com/drive/1tZpDWXz49W6wDcTprANRGLo2D_EbD5J8?usp=sharing) | [[ReSyncED] (coming soon)](#)

 <img src="https://drive.google.com/uc?export=view&id=1Wn0hPmpo4GRbCIJR8Tf20Akzdi1qjjG9"/>

----------
**Highlights**
----------
 - Lip-sync videos to any target speech with high accuracy. Try our [interactive demo](https://bhaasha.iiit.ac.in/lipsync).
 - Works for any identity, voice, and language. Also works for CGI faces and synthetic voices.
 - Complete training code, inference code, and pretrained models are available.
 - Or, quick-start with the Google Colab Notebook: [Link](https://colab.research.google.com/drive/1tZpDWXz49W6wDcTprANRGLo2D_EbD5J8?usp=sharing)
 - Several new, reliable evaluation benchmarks and metrics [[`evaluation/` folder of this repo]](https://github.com/Rudrabha/Wav2Lip/tree/master/evaluation) released.
 - Code to calculate metrics reported in the paper is also made available.

--------
**Disclaimer**
--------
All results from this open-source code or our [demo website](https://bhaasha.iiit.ac.in/lipsync) should only be used for research/academic/personal purposes only. As the models are trained on the <a href="http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html">LRS2 dataset</a>, any form of commercial use is strictly prohibhited. Please contact us for all further queries.   

Prerequisites
-------------
- `Python 3.5.2` (code has been tested with this version at our end, but several other users say that `3.6+` is the one that works instead.)
- ffmpeg: `sudo apt-get install ffmpeg`
- Install necessary packages using `pip install -r requirements.txt`
- Face detection [pre-trained model](https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth) should be downloaded to `face_detection/detection/sfd/s3fd.pth`. Alternative [link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/prajwal_k_research_iiit_ac_in/EZsy6qWuivtDnANIG73iHjIBjMSoojcIV0NULXV-yiuiIg?e=qTasa8) if the above does not work.

Getting the weights
----------
| Model  | Description |  Link to the model | 
| :-------------: | :---------------: | :---------------: |
| Wav2Lip  | Highly accurate lip-sync | [Link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/Eb3LEzbfuKlJiR600lQWRxgBIY27JZg80f7V9jtMfbNDaQ?e=TBFBVW)  |
| Wav2Lip + GAN  | Slightly inferior lip-sync, but better visual quality | [Link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/EdjI7bZlgApMqsVoEUUXpLsBxqXbn5z8VTmoxp55YNDcIA?e=n9ljGW) |
| Expert Discriminator  | Weights of the expert discriminator | [Link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/EQRvmiZg-HRAjvI6zqN9eTEBP74KefynCwPWVmF57l-AYA?e=ZRPHKP) |

Lip-syncing videos using the pre-trained models (Inference)
-------
You can lip-sync any video to any audio:
```bash
python inference.py --checkpoint_path <ckpt> --face <video.mp4> --audio <an-audio-source> 
```
The result is saved (by default) in `results/result_voice.mp4`. You can specify it as an argument,  similar to several other available options. The audio source can be any file supported by `FFMPEG` containing audio data: `*.wav`, `*.mp3` or even a video file, from which the code will automatically extract the audio.

##### Tips for better results:
- Experiment with the `--pads` argument to adjust the detected face bounding box. Often leads to improved results. You might need to increase the bottom padding to include the chin region. E.g. `--pads 0 20 0 0`.
- If you see the mouth position dislocated or some weird artifacts such as two mouths, then it can be because of over-smoothing the face detections. Use the `--nosmooth` argument and give another try. 
- Experiment with the `--resize_factor` argument, to get a lower resolution video. Why? The models are trained on faces which were at a lower resolution. You might get better, visually pleasing results for 720p videos than for 1080p videos (in many cases, the latter works well too). 
- The Wav2Lip model without GAN usually needs more experimenting with the above two to get the most ideal results, and sometimes, can give you a better result as well.

Preparing LRS2 for training
----------
Our models are trained on LRS2. See [here](#training-on-datasets-other-than-lrs2) for a few suggestions regarding training on other datasets.
##### LRS2 dataset folder structure

```
data_root (mvlrs_v1)
├── main, pretrain (we use only main folder in this work)
|	├── list of folders
|	│   ├── five-digit numbered video IDs ending with (.mp4)
```

Place the LRS2 filelists (train, val, test) `.txt` files in the `filelists/` folder.

##### Preprocess the dataset for fast training

```bash
python preprocess.py --data_root data_root/main --preprocessed_root lrs2_preprocessed/
```
Additional options like `batch_size` and number of GPUs to use in parallel to use can also be set.

##### Preprocessed LRS2 folder structure
```
preprocessed_root (lrs2_preprocessed)
├── list of folders
|	├── Folders with five-digit numbered video IDs
|	│   ├── *.jpg
|	│   ├── audio.wav
```

Train!
----------
There are two major steps: (i) Train the expert lip-sync discriminator, (ii) Train the Wav2Lip model(s).

##### Training the expert discriminator
You can download [the pre-trained weights](#getting-the-weights) if you want to skip this step. To train it:
```bash
python color_syncnet_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints>
```
##### Training the Wav2Lip models
You can either train the model without the additional visual quality disriminator (< 1 day of training) or use the discriminator (~2 days). For the former, run: 
```bash
python wav2lip_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints> --syncnet_checkpoint_path <path_to_expert_disc_checkpoint>
```

To train with the visual quality discriminator, you should run `hq_wav2lip_train.py` instead. The arguments for both the files are similar. In both the cases, you can resume training as well. Look at `python wav2lip_train.py --help` for more details. You can also set additional less commonly-used hyper-parameters at the bottom of the `hparams.py` file.

Training on datasets other than LRS2
------------------------------------
Training on other datasets might require modifications to the code. Please read the following before you raise an issue:

- You might not get good results by training/fine-tuning on a few minutes of a single speaker. This is a separate research problem, to which we do not have a solution yet. Thus, we would most likely not be able to resolve your issue. 
- You must train the expert discriminator for your own dataset before training Wav2Lip.
- If it is your own dataset downloaded from the web, in most cases, needs to be sync-corrected.
- Be mindful of the FPS of the videos of your dataset. Changes to FPS would need significant code changes. 

When raising an issue on this topic, please let us know that you are aware of all these points.

Evaluation
----------
Will be updated.

License and Citation
----------
The software is licensed under the MIT License. Please cite the following paper if you have use this code:
```
@misc{prajwal2020lip,
    title={A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild},
    author={K R Prajwal and Rudrabha Mukhopadhyay and Vinay Namboodiri and C V Jawahar},
    year={2020},
    eprint={2008.10010},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
```


Acknowledgements
----------
Parts of the code structure is inspired by this [TTS repository](https://github.com/r9y9/deepvoice3_pytorch). We thank the author for this wonderful code. The code for Face Detection has been taken from the [face_alignment](https://github.com/1adrianb/face-alignment) repository. We thank the authors for releasing their code and models.
Initial commit 2020-08-17 17:08:19 +00:00			`# Wav2Lip: Accurately Lip-syncing Videos In The Wild`

Added terms of use 2020-09-08 11:51:14 +00:00			`This code is part of the paper: _A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild_ published at ACM Multimedia 2020.`
Initial commit 2020-08-17 17:08:19 +00:00
Added SoTA badges 2020-08-30 14:34:16 +00:00			`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-lip-sync-expert-is-all-you-need-for-speech/lip-sync-on-lrs2)](https://paperswithcode.com/sota/lip-sync-on-lrs2?p=a-lip-sync-expert-is-all-you-need-for-speech)`
			`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-lip-sync-expert-is-all-you-need-for-speech/lip-sync-on-lrs3)](https://paperswithcode.com/sota/lip-sync-on-lrs3?p=a-lip-sync-expert-is-all-you-need-for-speech)`
			`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-lip-sync-expert-is-all-you-need-for-speech/lip-sync-on-lrw)](https://paperswithcode.com/sota/lip-sync-on-lrw?p=a-lip-sync-expert-is-all-you-need-for-speech)`

Added demo video link 2020-09-03 08:46:52 +00:00			`[[Paper]](http://arxiv.org/abs/2008.10010) \| [[Project Page]](http://cvit.iiit.ac.in/research/projects/cvit-projects/a-lip-sync-expert-is-all-you-need-for-speech-to-lip-generation-in-the-wild/) \| [[Demo Video]](https://youtu.be/0fXaDCZNOJc) \| [[Interactive Demo]](https://bhaasha.iiit.ac.in/lipsync) \| [[Collab Notebook]](https://colab.research.google.com/drive/1tZpDWXz49W6wDcTprANRGLo2D_EbD5J8?usp=sharing) \| [[ReSyncED] (coming soon)](#)`
Added GIF and checkpoint links 2020-08-23 17:11:15 +00:00
Update README.md 2020-08-23 17:17:51 +00:00			`<img src="https://drive.google.com/uc?export=view&id=1Wn0hPmpo4GRbCIJR8Tf20Akzdi1qjjG9"/>`
Initial commit 2020-08-17 17:08:19 +00:00
			`----------`
			`Highlights`
			`----------`
Added the interactive demo link at another place 2020-08-24 17:17:13 +00:00			`- Lip-sync videos to any target speech with high accuracy. Try our [interactive demo](https://bhaasha.iiit.ac.in/lipsync).`
Initial commit 2020-08-17 17:08:19 +00:00			`- Works for any identity, voice, and language. Also works for CGI faces and synthetic voices.`
			`- Complete training code, inference code, and pretrained models are available.`
Added the Colab Notebook 2020-08-24 17:15:12 +00:00			`- Or, quick-start with the Google Colab Notebook: [Link](https://colab.research.google.com/drive/1tZpDWXz49W6wDcTprANRGLo2D_EbD5J8?usp=sharing)`
Initial commit 2020-08-17 17:08:19 +00:00			- Several new, reliable evaluation benchmarks and metrics [[`evaluation/` folder of this repo]](https://github.com/Rudrabha/Wav2Lip/tree/master/evaluation) released.
			`- Code to calculate metrics reported in the paper is also made available.`

Added terms of use 2020-09-08 11:51:14 +00:00			`--------`
Added clear disclaimer 2020-09-09 08:26:42 +00:00			`Disclaimer`
Added terms of use 2020-09-08 11:51:14 +00:00			`--------`
Added clear disclaimer 2020-09-09 08:26:42 +00:00			`All results from this open-source code or our [demo website](https://bhaasha.iiit.ac.in/lipsync) should only be used for research/academic/personal purposes only. As the models are trained on the <a href="http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html">LRS2 dataset</a>, any form of commercial use is strictly prohibhited. Please contact us for all further queries.`
Added terms of use 2020-09-08 11:51:14 +00:00
Initial commit 2020-08-17 17:08:19 +00:00			`Prerequisites`
			`-------------`
Clarify the Python version 2020-09-14 09:55:24 +00:00			- `Python 3.5.2` (code has been tested with this version at our end, but several other users say that `3.6+` is the one that works instead.)
Initial commit 2020-08-17 17:08:19 +00:00			- ffmpeg: `sudo apt-get install ffmpeg`
			- Install necessary packages using `pip install -r requirements.txt`
Alternative link for face detection model 2020-08-31 10:50:36 +00:00			- Face detection [pre-trained model](https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth) should be downloaded to `face_detection/detection/sfd/s3fd.pth`. Alternative [link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/prajwal_k_research_iiit_ac_in/EZsy6qWuivtDnANIG73iHjIBjMSoojcIV0NULXV-yiuiIg?e=qTasa8) if the above does not work.
Initial commit 2020-08-17 17:08:19 +00:00
			`Getting the weights`
			`----------`
			`\| Model \| Description \| Link to the model \|`
			`\| :-------------: \| :---------------: \| :---------------: \|`
Added GIF and checkpoint links 2020-08-23 17:11:15 +00:00			`\| Wav2Lip \| Highly accurate lip-sync \| [Link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/Eb3LEzbfuKlJiR600lQWRxgBIY27JZg80f7V9jtMfbNDaQ?e=TBFBVW) \|`
			`\| Wav2Lip + GAN \| Slightly inferior lip-sync, but better visual quality \| [Link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/EdjI7bZlgApMqsVoEUUXpLsBxqXbn5z8VTmoxp55YNDcIA?e=n9ljGW) \|`
Added weights to the lipsync expert 2020-08-28 10:28:57 +00:00			`\| Expert Discriminator \| Weights of the expert discriminator \| [Link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/EQRvmiZg-HRAjvI6zqN9eTEBP74KefynCwPWVmF57l-AYA?e=ZRPHKP) \|`
Initial commit 2020-08-17 17:08:19 +00:00
			`Lip-syncing videos using the pre-trained models (Inference)`
			`-------`
			`You can lip-sync any video to any audio:`
			```bash
			`python inference.py --checkpoint_path <ckpt> --face <video.mp4> --audio <an-audio-source>`
			```
			The result is saved (by default) in `results/result_voice.mp4`. You can specify it as an argument, similar to several other available options. The audio source can be any file supported by `FFMPEG` containing audio data: `.wav`, `.mp3` or even a video file, from which the code will automatically extract the audio.

			`##### Tips for better results:`
			- Experiment with the `--pads` argument to adjust the detected face bounding box. Often leads to improved results. You might need to increase the bottom padding to include the chin region. E.g. `--pads 0 20 0 0`.
Add suggestion about nosmooth 2020-09-13 07:23:05 +00:00			- If you see the mouth position dislocated or some weird artifacts such as two mouths, then it can be because of over-smoothing the face detections. Use the `--nosmooth` argument and give another try.
Initial commit 2020-08-17 17:08:19 +00:00			- Experiment with the `--resize_factor` argument, to get a lower resolution video. Why? The models are trained on faces which were at a lower resolution. You might get better, visually pleasing results for 720p videos than for 1080p videos (in many cases, the latter works well too).
			`- The Wav2Lip model without GAN usually needs more experimenting with the above two to get the most ideal results, and sometimes, can give you a better result as well.`

			`Preparing LRS2 for training`
			`----------`
Suggestions for training on a different dataset 2020-09-14 10:27:23 +00:00			`Our models are trained on LRS2. See [here](#training-on-datasets-other-than-lrs2) for a few suggestions regarding training on other datasets.`
Initial commit 2020-08-17 17:08:19 +00:00			`##### LRS2 dataset folder structure`

			```
			`data_root (mvlrs_v1)`
			`├── main, pretrain (we use only main folder in this work)`
			`\| ├── list of folders`
			`\| │ ├── five-digit numbered video IDs ending with (.mp4)`
			```

			Place the LRS2 filelists (train, val, test) `.txt` files in the `filelists/` folder.

			`##### Preprocess the dataset for fast training`

			```bash
			`python preprocess.py --data_root data_root/main --preprocessed_root lrs2_preprocessed/`
			```
			Additional options like `batch_size` and number of GPUs to use in parallel to use can also be set.

			`##### Preprocessed LRS2 folder structure`
			```
			`preprocessed_root (lrs2_preprocessed)`
			`├── list of folders`
			`\| ├── Folders with five-digit numbered video IDs`
			`\| │ ├── *.jpg`
			`\| │ ├── audio.wav`
			```

			`Train!`
			`----------`
			`There are two major steps: (i) Train the expert lip-sync discriminator, (ii) Train the Wav2Lip model(s).`

			`##### Training the expert discriminator`
Suggestions for training on a different dataset 2020-09-14 10:27:23 +00:00			`You can download [the pre-trained weights](#getting-the-weights) if you want to skip this step. To train it:`
Initial commit 2020-08-17 17:08:19 +00:00			```bash
			`python color_syncnet_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints>`
			```
			`##### Training the Wav2Lip models`
			`You can either train the model without the additional visual quality disriminator (< 1 day of training) or use the discriminator (~2 days). For the former, run:`
			```bash
			`python wav2lip_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints> --syncnet_checkpoint_path <path_to_expert_disc_checkpoint>`
			```

			To train with the visual quality discriminator, you should run `hq_wav2lip_train.py` instead. The arguments for both the files are similar. In both the cases, you can resume training as well. Look at `python wav2lip_train.py --help` for more details. You can also set additional less commonly-used hyper-parameters at the bottom of the `hparams.py` file.

Suggestions for training on a different dataset 2020-09-14 10:27:23 +00:00			`Training on datasets other than LRS2`
			`------------------------------------`
			`Training on other datasets might require modifications to the code. Please read the following before you raise an issue:`

			`- You might not get good results by training/fine-tuning on a few minutes of a single speaker. This is a separate research problem, to which we do not have a solution yet. Thus, we would most likely not be able to resolve your issue.`
			`- You must train the expert discriminator for your own dataset before training Wav2Lip.`
			`- If it is your own dataset downloaded from the web, in most cases, needs to be sync-corrected.`
			`- Be mindful of the FPS of the videos of your dataset. Changes to FPS would need significant code changes.`

			`When raising an issue on this topic, please let us know that you are aware of all these points.`

Initial commit 2020-08-17 17:08:19 +00:00			`Evaluation`
			`----------`
			`Will be updated.`

			`License and Citation`
			`----------`
Added citation information in readme 2020-08-27 04:53:08 +00:00			`The software is licensed under the MIT License. Please cite the following paper if you have use this code:`
			```
			`@misc{prajwal2020lip,`
			`title={A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild},`
			`author={K R Prajwal and Rudrabha Mukhopadhyay and Vinay Namboodiri and C V Jawahar},`
			`year={2020},`
			`eprint={2008.10010},`
			`archivePrefix={arXiv},`
			`primaryClass={cs.CV}`
			`}`
			```
Initial commit 2020-08-17 17:08:19 +00:00

			`Acknowledgements`
			`----------`
			`Parts of the code structure is inspired by this [TTS repository](https://github.com/r9y9/deepvoice3_pytorch). We thank the author for this wonderful code. The code for Face Detection has been taken from the [face_alignment](https://github.com/1adrianb/face-alignment) repository. We thank the authors for releasing their code and models.`