[2] Gemmeke, J. F., et al. (2017). AudioSet: An ontology and human-labeled dataset for audio events. ICASSP .
[6] McInnes, L., Healy, J., & Astels, S. (2017). HDBSCAN: Hierarchical density based clustering. JOSS . : Sample scene embeddings (t-SNE visualization) and confusion matrix are available in the supplementary material. End of paper. moviescc
[3] Rao, A., et al. (2020). SceneFormer: Inductive bias for video scene segmentation. ECCV . [2] Gemmeke, J
: 87.4% Macro F1-score : 0.85