My children are not AI – #6 It’s hard to control all what they learn

On an evening, during a casual conversation, my son suddenly pronounced the French equivalent of the F-word. I was shocked to hear such a dirty word coming out of this cute and innocent mouth, without him even knowing what it meant. But it must have been funny to see my upset face, because it made him laugh and want to say it again and again. I’ve asked him where he’s learnt it from and he just replied “no one, I learnt it by myself”. We’ll never know where he heard it – Was it at school? With his sometimes foul-mouthed uncle? Or did it just slip out of my mouth on a bad day? Maybe all 3 and many others, but this time, it has stayed in his memory for good, and he knows it’s bad. Our kids are learning all the time, and as they grow up and interact more and more with society, it becomes more difficult to control the sources of their learnings.

Similar issues have happened with AI chatbots. The most famous one was Tay, Microsoft AI bot, which compared feminism to cancer and suggested the Holocaust did not happen after being “trained” by Twitter users for 16 hours in 2016. More recently, the chatbot of a parcel delivery company, DPD, swore at a customer and criticized the company. Large Language Model (LLM) AI are trained on enormous amount of data, usually taken from the internet, and this obviously includes swearing and inappropriate content. AI are trained to give the most likely answer, and if prompted by a user, may also be foul and offensive. To avoid that, a good alternative for companies support chatbot would be to use Small Language Models (SLM). These models are trained on much smaller datasets, are more contextual, and consequently cheaper. Controlling the source of data and re-adjusting code would be less difficult than for LLM.

But there’s more worrying than AI learning curses. In a 2023 study, Stanford’s Internet Observatory found that LAION-5B, an open dataset used by Stable Diffusion, included hundreds of child sexual abuse materials scraped from social media posts and popular adult websites. They have used Microsoft PhotoDNA, a hashing tool which runs the fingerprints of images against databases maintained by nonprofits that analyse reports of online child sexual exploitation and abuse. These images have been removed from the datasets, but they were already widely disseminated to train models. These images have been used to create AI-generated sexual abuse materials and, besides, we are unclear exactly what AIs have learnt from these images and their victims. Stable Diffusion has taken steps to implement filters and intercept unsafe prompts or outputs, but this will not suffice. Besides, many AI companies are not transparent on the source of their data and may still train their AI on this dataset.

Are these models even lawful if they are trained on illegal data? This question is making a lot of these companies, and governments, very uncomfortable. Vetting all these images would be daunting and very expensive, hindering innovation and competitiveness. We are only just starting to see the tip of the iceberg of the harm it can do to society, and examples appear daily. Injecting ethics into AI governance is increasingly critical, and regulation would need to follow to impose minimum safeguards, even if it won’t solve every problem.

My child has (hopefully!) never been exposed to illegal content yet, and as a parent, it is my duty to ensure he can only access content fit for his age on television and on the internet. But as he grows up and gets more independent, he will hear and see more and more inappropriate content and I won’t be able to filter everything out. Besides, he can’t remain unaware of all bad things for the rest of his life. He will need to be able to identify what is inappropriate and understand why it’s wrong, to not disseminate further, create or act on this type of content. Can AI also grow up and be that smart one day, to filter out bad content on its own and only produce acceptable output? Only time will tell.

Elodie Pierloot

My children are not AI – #6 It’s hard to control all what they learn.

Menu principal

Légal