In this work we try to perform emotional style transfer on audios. In particular, MelGAN-VC architecture is explored for various emotion-pair transfers. The generated audio is then classified using an LSTM-based emotion classifier for audio. We find that "sad" audio is generated well as compared to "happy" or "anger" as people have similar expressions of sadness.
Detecting suspicious activities in surveillance videos has been a longstanding problem, which can further lead to difficulties in detecting crimes. The authors propose a novel approach for detecting and summarizing the suspicious activities going on in the surveillance videos. They also create ground truth summaries for the UCF-Crime video dataset. Further, the authors test existing state-of-the-art algorithms for Dense Video Captioning for a subset of this dataset and propose a model for this task by leveraging Human-Object Interaction models for the Visual features. They observe that this formulation for Dense Captioning achieves large gains over earlier approaches by a significant margin. The authors also perform an ablative analysis of the dataset and the model and report their findings.