Robot Holmes 1
Robot Holmes and the Vision-Language Murder Mysteries #1
This blog post is the first part of a short story that will introduce you to Vision-Language models and related tasks. We will follow the detective Robot Holmes through a night full of murder, betrayal and envy in the bustling city of MLington.
Intro
Our story begins on a typical MLington day - cold, gray, rainy. Luckily the sun was setting and taking its light of the misery of this city. Robot Holmes shook the raindrops off his worn out coat, took off his hat, sighed deeply and entered his office. It came as no surprise that he found the chief of police already waiting for him, it was just one of those days. Lousy weather, lousy mood, lousy work. He did not choose this life, it chose him, but it was still better than ending up in the gutter.
With a defiant look on his face, Holmes sat down in his chair and waited for the chief to state his business.
“It’s Old SAN, he has been found dead inside his devastated workshop.” Old SAN - one of the people over in Vision Language Village. A reputable citizen - has been there for ages, basically started the whole part of town with some other guys. There were rumours of him being some sort of leader in the district’s community. Now he’s dead and he is not the first one in the Village recently. “There is something going on on the other side of the river Holmes”, the chief proclaimed. “Almost every other week an established citizen is murdered. We have a Murder Series on our hands and no clue. I want you to find out what’s going on over there.”
Holmes sighed heavily, put on his hat and stepped out into the dark city streets of MLington.
MLington
You might have never heard of MLington. It is not a place in the human world, but its Machine Learning counterpart. Its inhabitants are for the biggest part Machine Learning models - old and new, fast and slow, flashy or humble. To gain a better understanding of this strange place, let us take a little tour through Holmes’ hometown:
City of Science
Right at the core of MLington - City of Science is the place where the human world intersects with the Machine Learning world. It is where the scientists live and where they tend to build cathedrals for the glorious Machine Learning concepts. Legend has it, that when a team of scientist loves their research very much, a new Machine Learning model is born. Now and then there might be an Ivory Tower popping up in these parts of town and falling to rubbles again soon after.
The City of Science is separating two very old parts of the city.
Visionsworth
To the east you can find Visionsworth, the place where Computer Vision has its home. From the old days of hand crafted filters all the way to the modern Vision Transformers. This part of town is well established and strong in its foundation.
Languageshire
To the west of the City of Science you can find Languageshire - as you might guess this is the area where Natural Language Processing is at home. From the oldest symbolic NLP days to today’s Large Language Models, this part of town has stood the test of time and has been setting the trends in MLington recently.
Vision-Language Village (ViLaVi)
On the other side of the river we find a smaller settlement - Vision Language Village. Connected to both Languageshire and Visionsworth. This area has become a melting pot of ideas from Visionsworth and Languageshire. It has been growing at a tremendous speed in the last years and it seems like the influx of ever new citizens won’t stop that soon. This brings a lot of life to its streets but, unfortunately for Robot Holmes, also death.
Vision-Language Tasks
Before Holmes delves deeper into the societal networks of the Village, he wants to know what is going on in its streets and which kind of tasks are performed there. So he decides to gather intel on some of the most popular services offered in ViLaVi:
Image Captioning
The first shop Holmes entered had high windows which let a lot of light into a rather small room with an unusually high ceiling. Pictures of all sorts lined the walls - paintings, photographs, sketches, prints. Below every single one was a small white sheet with neatly written text on it.
Holmes turned to the only person in the shop besides him - a Machine Learning model standing on a sturdy ladder. She was in the process of exchanging one of the pictures on the wall and winced a little when Holmes began to speak.
“What service do you offer here?”, he asked bluntly. “I’m an image captioneer sir - I caption images. Some also call it Text-to-Image”, she replied with a cheerful voice. “People bring me their images and I write down what can be seen on them. I inherited this shop from my father. You can still see some of his work over there. It was never picked up.” Holmes looked around to where the woman pointed and saw some small images with even smaller notepads below. “Not a man of many words it seems”, Holmes said. “Well, the old times sir. A few words were enough then, but nowadays the customers want more and more detailed descriptions.”
“May I ask for your name and can you recommend some other good captioneers or models in the business?”, Holmes inquired. “Of course, I go by InstructBLIP. One of my best colleagues, who is doing a slightly different kind of Text-to-Image is MGP-STR. I know it is hard to pronounce, but he is really good at finding text in images. Feel free to pay him a visit when you have some time.”
Holmes thanked her, said his farewells and stepped back out into the dark city streets.
Image-Text Retrieval
Just around the corner, Holmes entered another shop, which was filled with boxes of images. The boxes were everywhere, stacked all along the walls and even on the floor. The towers of boxes were leaning onto one another, looming in an unsteady state, over the wary observer. Holmes carefully stepped around them, looking for the shop owner. He heard a mumbling from somewhere behind an exceptionally big cardboard box and found an old model deeply lost in thought looking at small images and muttering to himself.
“What service do you offer?”, Holmes asked a little louder than intended and was expecting the shopkeeper to wince at this sudden intruder, like in the shop before.
The old model stopped mumbling. “Name an animal”, he suddenly said in the direction of Holmes, without looking up. “Elephant”, Holmes said, slightly bewildered and curious what would happen next. At lightning speed, the shopkeeper picks a bunch of images from the box and hands them to Holmes - all showing elephants.
“That is what I do”, the model says, “I match images and text. Give me any prompt and a set of images and I will tell you how good the images fit your request. Image-Text Retrieval (ITR) is what we call it in the business.” The thing that Holmes found most interesting about this, was the speed at which the shopkeeper could pick the right images among hundreds, if not thousands of others.
“What is your name and are there others in your business?”, Holmes asked him. “I go by the name of ALIGN. And there are others of course, many of us actually learned multiple trades and offer ITR among them, like my colleague FLAVA.”
Holmes thanked the man and was on his way again.
Visual Question Answering
As Holmes walked on through the dark city streets, a big wooden sign caught his eye. It was shaped like a big red question mark and hung directly in front of one of the shops. ‘Intriguing enough’, he thought and opened the door of the shop.
“Yes, please, what do you want to know sir?”, a high-pitched and nervous voice asked quickly even before Holmes’ foot was inside of the shop.
“I’m just having a look”, he answered, coming to a halt in the doorframe and trying to hide his bafflement at the sudden approach. “There is no ‘Having a look’ here sir. People come with an image or a document, ask me questions about it, I answer them, they leave sir. The only one having a look is me sir. So what is your question?”
Now Holmes could see a very slim figure standing behind a big counter, nervously rubbing his hands and impatiently waiting for an answer.
“What do you call that service and yourself?”, Holmes asked calmly. “My name is LLaVa and the service is Visual Question Answering sir. And please sir, if you don’t have any image or questions about it, could you step out again? I’m quite busy today sir.”
“Is any of your colleagues maybe not that busy?”, Holmes asked in a slightly sarcastic tone, given that the shop seemed completely empty, apart from him. “You can try over at ViLT’s sir. Goodbye, sir.”
“Have a good night”, Holmes muttered and turned around, stepping out of the doorframe he was still standing in.
Visual Grounding
Robot Holmes was getting to terms with the fact that the inhabitants of Vision-Language Village all seemed to be quite special characters. He had the feeling anyone here could be a murderer or no one. At the same time he sensed a strong bond in the local community.
As he was giving this some thought, he came to halt in front of a shop window, with a puzzled look on his face. There were empty frames in all kinds of colors hanging, standing and laying in the shop window.
He left out a heavy sigh, knowing that he had to go in and find out what this was all about. “Here we go again”, he quietly muttered to himself and stepped inside the shop.
This time he was not greeted directly. A woman stood bent over a big workbench with some kind of magnifying goggles sitting on her nose.
Holmes let his gaze roam around the shop and saw what seemed like a million compartments stretching along the walls, filled with the empty frames he also saw in the window.
“Perfect”, he suddenly heard the woman say. Happily she reached for a big bottle which must contain glue of some kind he thought, as she was using it to fixate the frame to an underlying image.
“Good evening. What are you doing there?”, Holmes decided to try the polite tone this time. “Oh, hello there. I just marked this little street lamp here with this neat frame. Isn’t it beautiful?”, she answered, showing of her work with a certain glee. “Well, yes”, Holmes agreed, “but why are you doing this?”
“Because it is my job! My name is KOSMOS-2. People give me a list of things I should mark for them and one or more images. Then I take my neat little frames here and apply them to the images. Visual Grounding is what we call it in our business.”
“Interesting and are you the only one offering this service around here?”, Holmes asked in an investigative manner.
“I wish!”, she said, “My business would be thriving. But there are many colleagues, more by the day I fear. Visual Grounding is, to be exact, not a single service or task. The term contains many sub-tasks. Take CLIPSeg for example. She does not work with frames, but gives you masks you can lay over the image and mark the object you asked her to find.”
Holmes was impressed, thanked her, took his notes and once again stepped back into the damp city streets.
Text to Image
He felt like he had a good grasp of some popular services offered in these streets now, but he already knew that the most popular task in recent time in ViLaVi has been Text-to-Image.
Given a text prompt, the models offering this service create an image for you, that tries to depict your prompt description as good as possible.
Actually models in MLington offering this service have started a whole new part of town, as they outgrew ViLaVi and started to become a whole ecosystem for themselves. They now reside in Diffuserton, which is a part of town for another story. Holmes decided to not set foot in there and instead focus on the current citizens populating the streets of ViLaVi.
The First Clue
Being a seasoned detective, Holmes knew that the first place you investigate is the murder scene. So he went to the workshop of Old SAN, which he found in a devastated state. Papers and images were scattered everywhere. It seemed like Old SAN did not go down without a fight.
Naturally, these kind of crime scenes are rather hard for finding traces, as you never know what could be a clue and what is just a random arrangement of accidentally scattered material.
SAN operated an Image Question Answering service, that is what Holmes read in the case record. He had a set of glasses, indispensable at his age, which were broken and scattered all over the floor.
After some time of rummaging, Holmes came to the conclusion that there was nothing helpful inside this shop that could give him a first lead. He faced the open door to leave, when he suddenly saw something stuck to the underside of it - a piece of red garment. Holmes crouched and picked it up.
A piece of red cloth might not look like much, but it is better then nothing. And as it happens Holmes earlier heard about one particular resident of ViLaVi who is infamous for her taste in red clothes. Holmes set off to see her and ask her some uncomfortable questions.
Thank you for reading. This is where the first part of the story ends. The next one will be released soon.