Bot Series - I Tried Breaking ChatGPT’s 'New' Image Generator: Here’s What I Learned (Part 2)

Apr 20, 2025

Welcome back! In Part 1 of this subseries post, I shared the first half of my deep-dive experiments testing the limits and capabilities of ChatGPT's latest image generator.

In this second and final installment, I'll focus on GPT’s surprising strength in creative storytelling, its evolving ability to handle detailed image edits, and most importantly, takeaway notes and usage tips based on my experiences across all my experiments.

Here's everything else I learned.

2. Creative Narratives – GPT’s "Almost Agentic" Strength

After exploring ChatGPT’s capabilities in procedural visualization, I wanted to challenge it differently: how well could GPT perform when given more creative freedom? Unlike the structured, logic-driven visuals discussed in my last post, this time, I asked GPT to generate a simple, illustrated story. I purposefully left details on narrative, pacing, and imagery entirely up to its discretion.

But before jumping into the story, I had noticed GPT struggling with character consistency in my previous back-extension image experiment (read about this in Part 1). So, I decided to test if 'anchoring' a character would help with maintaining visual consistency across multiple panels. By ‘anchoring’, I mean providing GPT with a specific visual it can refer to while generating images. Thus, I gave GPT a photo of my 22-month-old niece and asked it to use it as inspiration for character illustration. Here's the final character design GPT came up with based off of her:

With our main character established, I then asked GPT to create a four-panel creative story around her, without any further instructions or narrative constraints. With that, GPT generated a short story book within minutes:

Although simple, GPT intuitively structured the story following a classic narrative arc: set up, rising action, climax, and resolution. I thought it did a pretty good job creating a coherent storyline, conveying emotions through distinct facial expressions, and keeping character consistency across panels.

That said, maintaining visual consistency was likely easier here compared to the more complex, photorealistic imagery in the back-extension example. Not only did GPT have a clear character ‘anchor’ to reference for each panel, but the story’s simplified illustrative style probably made consistent character appearance more manageable. Still, GPT struggled with minor details like misplacing the girl's hairpin across panels, and required several iterations to eliminate text cut-offs completely.

But overall, the creative quality of this exercise felt different. For one thing, it was notably more sophisticated than simply prompting GPT with stylization tasks (e.g., turning humans into plush toys or applying Ghibli-style visuals) thats trending these days. While fun and captivating, those tasks involve applying known stylistic formulas, whereas this story required GPT to invent narrative content, establish sequences, and draw up appropriate visuals to go with the storyline.

To create this kind of cohesive narrative, GPT internally had to address fundamental storytelling questions like “What’s the story about?”, “What happens next?”, and “How should it end?”. It likely drew on narrative patterns learned during training, predicted plausible events, and balanced emotional and logical rhythms. Moreover, the four panel requirement probably forced GPT into creative decision-making: it needed to choose which events to depict, how to pace the narrative, and how to ensure clarity and coherence within tight constraints.

This particular experiment left me both fascinated and a bit unsettled. While still fundamentally an AI assistant, I felt that GPT was beginning to demonstrate behaviors resembling agent-like thinking.

To be clear, GPT doesn’t possess true agent autonomy, like self-correcting visual errors or proactively changing story details unprompted. But its ability to infer implicit intent (“this is a story so images need to be interdependent, not isolated visuals”), set internal mini-goals (“a story must have a complete narrative arc”), and plan within defined constraints (“the story must fit within four panels”) strongly hints at emergent behaviors in LLMs.

Importantly, GPT remains neither conscious nor sentient. However, its increasingly sophisticated pattern recognition skills and ability to simulate structured reasoning so convincingly shows just how rapidly generative AI is developing. When an AI begins to simulate reasoning, planning, and narrative coherence on its own, it invites the question: how far off is Artificial General Intelligence, really?

3. Beyond Image Generation — GPT’s Editing Capabilities and Limitations

Finally, I wanted to test GPT’s updated editing capabilities. And for this experiment, I revisited a prompt I'd previously used last year to evaluate how GPT’s capabilities improved after the recent update.

Last summer, I helped my dad create an image using ChatGPT for a presentation he was preparing. I gave GPT a detailed prompt derived from his presentation text, centered around finding meaning through kindness and sharing by focusing more on others across neighbors, borders, and religions, rather than oneself. For convenience, I'll call this the ‘Sharing Together’ message. These were some of the images GPT generated pre-update:

Figure 1

Figure 2

Figure 3

Though prompt depended, pre-update images generally had a cinematic, painterly, storybook-like quality, often semi-realistic. GPT’s initial attempt (Figure 1) was conceptually great at illustrating the theme of ‘Sharing Together,’ yet contained odd aspects like people placed on rooftops and disproportional scaling issues. Seeing this, I asked GPT to fix these unrealistic elements.

However, GPT’s next output was Figure 2, a completely different image. Yes, it solved the initial problems of rooftop placement and proportions, but it didn't maintain visual continuity from the first image. Nonetheless, I provided another instruction, requesting to show racial diversity among the people in the image. This resulted in Figure 3, again a completely different image, though this time with diverse racial representation.

What’s important to note here is that a critical limitation pre-update was GPT’s inability to edit specific elements within a generated image. Every editing request essentially triggered an entirely new image creation, often significantly different from previous versions.

Now post-update, I was excited to see how GPT would handle the same “Sharing Together” prompt. Here's what it initially generated:

Off the bat, I noticed a lot of improvements and changes. First, GPT incorporated diversity naturally without me explicitly instructing it, unlike my experience last year. It also intuitively integrated relevant portions of the text into the visual to further highlight the core message of my prompt. It also no longer turned to the cinematic/painterly style, but created the image in a flat, vector illustration style which is generally more approachable and editable.

Then, to test GPT’s editing capabilities, I asked it to show the woman in yellow on the far right to be shown wearing a hijab to further highlight religious diversity. GPT handled this request exceptionally well:

It was a pretty perfect edit - exactly what I had hoped for in precisely the right spot of the image. And notably, the rest of the image remained unchanged, which was a major improvement over pre-update experiences.

The next thing I asked GPT to change was the expression of the man in the green shirt on the far left to smile with his mouth closed. This time however, while GPT successfully adjusted the man's expression, other unintended changes occurred as well:

Although the man's expression now matched my request, facial features of other figures subtly changed, and details like wrinkles on their clothing had disappeared.

So why did these additional changes occur?

GPT doesn't truly ‘edit’ images by modifying specific regions. Instead, it actually regenerates the entire image each time, embedding requested edits into the new generation. Thus, while it does its best to replicate the same image each time it regenerates, minor discrepancies begin to accumulate over multiple edits and causes gradual visual drifts from its original appearance.

Most importantly, there seems to be a certain ‘regeneration threshold’, where once the number of edit requests goes beyond a handful of times, GPT tends to generate a completely new interpretation of the image, like how it did pre-update. It also struggles with visual short-term memory, often creating a new stylistic interpretation if too much time passes between edits as well.

Big Ideas & Takeaways: Tips on Using GPT’s Latest Image Generator

Despite these limitations, its clear that GPT’s image generation and editing capabilities have significantly improved post-update. Moreover, the latest GPT updates aren’t just incremental improvements in image generation - they mark a meaningful step towards integrated multimodal reasoning, where text is becoming an effective guide for generating useful visuals.

From my experiments, here are some key insights and practical tips I’ve gathered:

Creative Narrative Generation is easier for GPT compared to Procedural Visualization: GPT excels when given creative freedom rather than strict visual logic constraints. It has also noticeably gotten better at incorporating and clearly rendering text within images, boosting usability for creating infographics.
Allow GPT creative flexibility whenever possible, as fewer constraints generally produce better results.
Provide ultra-specific instructions for visuals requiring high precision, such as detailed human anatomy or body movements.

Multi-panel generation is possible, but still challenging: Although GPT has improved, achieving consistent detail across multiple sequential panels remains difficult. Remember that when ‘editing’, GPT isn’t editing one specific panel within a sequence, but is regenerating the whole image (so all four panels), making each turn more prone to visual drifts
“Anchor” figures or subjects explicitly for better visual consistency - this reduces the likelihood of inconsistencies across multiple panels.
Separate first, combine later - for multi-panel visuals, generating each panel separately after explicitly anchoring key visual elements first and then combining them afterwards into a sequence may work better.

Navigating current limitations: Default Bias, Short-Term Memory, Visual Drift and Regeneration Thresholds.
Default Bias: GPT tends to default to common visual patterns learned during training. Counteract this by providing active negation and precise context control. Ambiguous prompts often yield common, predictable visuals.
Short-term Memory: GPT's short-term visual memory across edits is limited, especially if there’s a significant time gap between edits. Completing your image edits in one sitting ensures better consistency.
Visual Drift & Regeneration Threshold: GPT regenerates the entire image upon each edit request rather than editing specific regions. Accumulated edits can cause visual drift, eventually hitting a tipping point where the image resets drastically (e.g., changing cast, layout, or environment). Minimizing the number of iterations helps maintain visual consistency.

Still, image generation and editing capabilities have grown impressively:
Enhanced ability to modify characters, clothing, and expressions through targeted edits.
Markedly better text incorporation within visuals.
Improved capability to extract and visually express underlying themes from abstract or implicit prompts (e.g., automatically depicting racial diversity from the general message).
Greater visual logic and coherence (e.g., no more unrealistic placements like people on rooftops!).

Ultimately, frequent hands-on experimentation is key when it comes to understanding and mastering AI tools. While I’ll continue exploring and documenting these evolving capabilities, I’d also love to hear about your experiences and discoveries. How have you experimented with GPT’s new visual features?

Yaheee

Apr 23

while the users may appear to be the experimenter, they are simultaneously providing GPT with data—shaped by their language habits, thought patterns..! for it to learn from and refine.

GPT accumulates data through its responses, while humans reveal their modes of thinking through their questions. It’s cool because the process is mutual interaction. In a way, it’s a strange coexistence where both sides are simultaneously experimenters and subjects. 무서운 지피티에요 꺄악

Expand full comment

1 reply by The Bull & The Bot

1 more comment...

The Bull & The Bot

Bot Series - I Tried Breaking ChatGPT’s 'New' Image Generator: Here’s What I Learned (Part 2)

2. Creative Narratives – GPT’s "Almost Agentic" Strength

3. Beyond Image Generation — GPT’s Editing Capabilities and Limitations

Big Ideas & Takeaways: Tips on Using GPT’s Latest Image Generator

Discussion about this post