3,839
edits
Juho Kunsola (talk | contribs) (→Media perhaps about synthetic human-like fakes: moved factual events under == Timeline of synthetic human-like fakes ==) |
Juho Kunsola (talk | contribs) (merged === Timeline of digital sound-alikes === to == Timeline of synthetic human-like fakes ==) |
||
Line 91: | Line 91: | ||
Living people can defend<ref group="footnote" name="judiciary maybe not aware">Whether a suspect can defend against faked synthetic speech that sounds like him/her depends on how up-to-date the judiciary is. If no information and instructions about digital sound-alikes have been given to the judiciary, they likely will not believe the defense of denying that the recording is of the suspect's voice.</ref> themselves against digital sound-alike by denying the things the digital sound-alike says if they are presented to the target, but dead people cannot. Digital sound-alikes offer criminals new disinformation attack vectors and wreak havoc on provability. | Living people can defend<ref group="footnote" name="judiciary maybe not aware">Whether a suspect can defend against faked synthetic speech that sounds like him/her depends on how up-to-date the judiciary is. If no information and instructions about digital sound-alikes have been given to the judiciary, they likely will not believe the defense of denying that the recording is of the suspect's voice.</ref> themselves against digital sound-alike by denying the things the digital sound-alike says if they are presented to the target, but dead people cannot. Digital sound-alikes offer criminals new disinformation attack vectors and wreak havoc on provability. | ||
---- | |||
=== 'Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis' 2018 by Google Research (external transclusion) === | |||
* In ''' | * In the '''2018''' at the '''[[w:Conference on Neural Information Processing Systems|Conference on Neural Information Processing Systems]]''' (NeurIPS) the work [http://papers.nips.cc/paper/7700-transfer-learning-from-speaker-verification-to-multispeaker-text-to-speech-synthesis 'Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis'] ([https://arxiv.org/abs/1806.04558 at arXiv.org]) was presented. The pre-trained model is able to steal voices from a sample of only '''5 seconds''' with almost convincing results | ||
The Iframe below is transcluded from [https://google.github.io/tacotron/publications/speaker_adaptation/ 'Audio samples from "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis"' at google.gituhub.io], the audio samples of a sound-like-anyone machine presented as at the 2018 [[w:NeurIPS]] conference by Google researchers. | |||
Observe how good the "VCTK p240" system is at deceiving to think that it is a person that is doing the talking. | |||
{{#Widget:Iframe - Audio samples from Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis by Google Research}} | {{#Widget:Iframe - Audio samples from Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis by Google Research}} | ||
Line 114: | Line 107: | ||
[[File:Helsingin-Sanomat-2012-David-Martin-Howard-of-University-of-York-on-apporaching-digital-sound-alikes.jpg|left|thumb|338px|A picture of a cut-away titled "''Voice-terrorist could mimic a leader''" from a 2012 [[w:Helsingin Sanomat]] warning that the sound-like-anyone machines are approaching. Thank you to homie [https://pure.york.ac.uk/portal/en/researchers/david-martin-howard(ecfa9e9e-1290-464f-981a-0c70a534609e).html Prof. David Martin Howard] of the [[w:University of York]], UK and the anonymous editor for the heads-up.]] | [[File:Helsingin-Sanomat-2012-David-Martin-Howard-of-University-of-York-on-apporaching-digital-sound-alikes.jpg|left|thumb|338px|A picture of a cut-away titled "''Voice-terrorist could mimic a leader''" from a 2012 [[w:Helsingin Sanomat]] warning that the sound-like-anyone machines are approaching. Thank you to homie [https://pure.york.ac.uk/portal/en/researchers/david-martin-howard(ecfa9e9e-1290-464f-981a-0c70a534609e).html Prof. David Martin Howard] of the [[w:University of York]], UK and the anonymous editor for the heads-up.]] | ||
---- | ---- | ||
=== Possible legal response: Outlawing digital sound-alikes (transcluded) === | === Possible legal response: Outlawing digital sound-alikes (transcluded) === | ||
Line 174: | Line 166: | ||
* '''2016''' | science | '''[http://www.niessnerlab.org/projects/thies2016face.html 'Face2Face: Real-time Face Capture and Reenactment of RGB Videos' at Niessnerlab.org]''' A paper (with videos) on the semi-real-time 2D video manipulation with gesture forcing and lip sync forcing synthesis by Thies et al, Stanford. <font color="green">'''Relevancy: certain'''</font> | * '''2016''' | science | '''[http://www.niessnerlab.org/projects/thies2016face.html 'Face2Face: Real-time Face Capture and Reenactment of RGB Videos' at Niessnerlab.org]''' A paper (with videos) on the semi-real-time 2D video manipulation with gesture forcing and lip sync forcing synthesis by Thies et al, Stanford. <font color="green">'''Relevancy: certain'''</font> | ||
* '''2016''' | science / demonstration | [[w:DeepMind]]'s [[w:WaveNet]] owned by [[w:Google]] also demonstrated ability to steal people's voices | |||
* '''2017''' | science | '''[http://grail.cs.washington.edu/projects/AudioToObama/ 'Synthesizing Obama: Learning Lip Sync from Audio' at grail.cs.washington.edu]'''. In SIGGRAPH 2017 by Supasorn Suwajanakorn et al. of the [[w:University of Washington]] presented an audio driven digital look-alike of upper torso of Barack Obama. It was driven only by a voice track as source data for the animation after the training phase to acquire [[w:lip sync]] and wider facial information from [[w:training material]] consisting 2D videos with audio had been completed.<ref name="Suw2017">{{Citation | * '''2017''' | science | '''[http://grail.cs.washington.edu/projects/AudioToObama/ 'Synthesizing Obama: Learning Lip Sync from Audio' at grail.cs.washington.edu]'''. In SIGGRAPH 2017 by Supasorn Suwajanakorn et al. of the [[w:University of Washington]] presented an audio driven digital look-alike of upper torso of Barack Obama. It was driven only by a voice track as source data for the animation after the training phase to acquire [[w:lip sync]] and wider facial information from [[w:training material]] consisting 2D videos with audio had been completed.<ref name="Suw2017">{{Citation | ||
Line 189: | Line 181: | ||
| access-date = 2020-06-26 }} | | access-date = 2020-06-26 }} | ||
</ref> <font color="green">'''Relevancy: certain'''</font> | </ref> <font color="green">'''Relevancy: certain'''</font> | ||
[[File:Adobe Corporate Logo.png|thumb|right|300px|[[w:Adobe Inc.]]'s logo. We can thank Adobe for publicly demonstrating their sound-like-anyone-machine in '''2016''' before an implementation was sold to criminal organizations.]] | |||
* '''<font color="red">2018</font>''' | <font color="red">science</font> and demonstration | '''[[w:Adobe Inc.]]''' publicly demonstrates '''[[w:Adobe Voco]]''', a '''sound-like-anyone machine''' [https://www.youtube.com/watch?v=I3l4XLZ59iw '#VoCo. Adobe Audio Manipulator Sneak Peak with Jordan Peele | Adobe Creative Cloud' on Youtube]. THe original Adobe Voco required '''20 minutes''' of sample '''to thieve a voice'''. <font color="green">'''Relevancy: certain'''</font>. | * '''<font color="red">2018</font>''' | <font color="red">science</font> and demonstration | '''[[w:Adobe Inc.]]''' publicly demonstrates '''[[w:Adobe Voco]]''', a '''sound-like-anyone machine''' [https://www.youtube.com/watch?v=I3l4XLZ59iw '#VoCo. Adobe Audio Manipulator Sneak Peak with Jordan Peele | Adobe Creative Cloud' on Youtube]. THe original Adobe Voco required '''20 minutes''' of sample '''to thieve a voice'''. <font color="green">'''Relevancy: certain'''</font>. | ||
Line 195: | Line 189: | ||
* '''<font color="red">2018</font>''' | <font color="red">science</font> and <font color="red">demonstration</font> | The work [http://papers.nips.cc/paper/7700-transfer-learning-from-speaker-verification-to-multispeaker-text-to-speech-synthesis 'Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis'] ([https://arxiv.org/abs/1806.04558 at arXiv.org]) was presented at the 2018 [[w:Conference on Neural Information Processing Systems]] (NeurIPS). The pre-trained model is able to steal voices from a sample of only '''5 seconds''' with almost convincing results. | * '''<font color="red">2018</font>''' | <font color="red">science</font> and <font color="red">demonstration</font> | The work [http://papers.nips.cc/paper/7700-transfer-learning-from-speaker-verification-to-multispeaker-text-to-speech-synthesis 'Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis'] ([https://arxiv.org/abs/1806.04558 at arXiv.org]) was presented at the 2018 [[w:Conference on Neural Information Processing Systems]] (NeurIPS). The pre-trained model is able to steal voices from a sample of only '''5 seconds''' with almost convincing results. | ||
* [https://www.washingtonpost.com/technology/2019/09/04/an-artificial-intelligence-first-voice-mimicking-software-reportedly-used-major-theft/?noredirect=on 'An artificial-intelligence first: Voice-mimicking software reportedly used in a major theft'], a 2019 Washington Post article | |||
* '''2019''' | demonstration | '''[https://www.thispersondoesnotexist.com/ 'Thispersondoesnotexist.com']''' (since February 2019) by Philip Wang. It showcases a [[w:StyleGAN]] at the task of making an endless stream of pictures that look like no-one in particular, but are eerily human-like. <font color="green">'''Relevancy: certain'''</font> | * '''2019''' | demonstration | '''[https://www.thispersondoesnotexist.com/ 'Thispersondoesnotexist.com']''' (since February 2019) by Philip Wang. It showcases a [[w:StyleGAN]] at the task of making an endless stream of pictures that look like no-one in particular, but are eerily human-like. <font color="green">'''Relevancy: certain'''</font> |