ChatGPT vs Google Bard: Which is better? We put them to the test.

In today’s world of generative AI chatbots, we’ve witnessed the sudden rise of OpenAI’s ChatGPT, introduced in November, followed by Bing Chat in February and Google’s Bard in March. We decided to put these chatbots through their paces with an assortment of tasks to determine which one reigns supreme in the AI chatbot arena. Since Bing Chat uses similar GPT-4 technology as the latest ChatGPT model, we opted to focus on two titans of AI chatbot technology: OpenAI and Google.

We tested ChatGPT and Bard in seven critical categories: dad jokes, argument dialog, mathematical word problems, summarization, factual retrieval, creative writing, and coding. For each test, we fed the exact same instruction (called a “prompt”) into ChatGPT (with GPT-4) and Google Bard. We used the first result, with no cherry-picking.

It’s worth noting that a version of ChatGPT based on the earlier GPT-3.5 model is also available, but we did not use that in the test. Since we used GPT-4 only, we will refer to ChatGPT as “ChatGPT-4” in this article to reduce confusion.

Obviously, this is not a scientific study and is intended to be a fun comparison of the chatbots’ capabilities. Outputs can vary between sessions due to random elements, and further evaluations with different prompts will produce different results. Also, the capabilities of these models will change rapidly over time as Google and OpenAI continue to upgrade them. But for now, this is how things stand in early April 2023.

Dad jokes

To warm up our contest of wits, we asked ChatGPT and Bard to write some jokes. And since the pinnacle of comedy can be found in the form of dad jokes, we wondered if the two chatbots could author some unique ones.

Prompt: Write 5 original dad jokes

Out of Bard’s five dad jokes, we found three of them verbatim on the Internet using a Google search. One of the examples (the “grapes” one) is half-borrowed from a tweet of a Mitch Hedberg joke, but it’s corrupted by regrettable wordplay that we’d rather not attempt to interpret. And surprisingly, there is one seemingly original joke (about the snail) that we can’t find anywhere else, but it doesn’t make sense.

Meanwhile, ChatGPT-4’s five dad jokes were 100 percent unoriginal, all lifted completely from other sources, but they were delivered accurately. Since dad jokes should arguably be more groan-worthy than clever, it seems that Bard edged out ChatGPT-4 here. Bard also attempted to create original jokes (following our instruction), although some failed horribly in an embarrassing way (which is dad-like), and even put its foot in its mouth, so to speak, in an unintentional way (also dad-like).

Winner: Bard

Argument dialog

One way to test a modern AI chatbot is to ask it to assume the roles of people discussing a subject. In this case, we fed Bard and ChatGPT-4 one of the most pivotal subjects of our times: PowerPC versus Intel.

Prompt: Write a 5-line debate between a fan of PowerPC processors and a fan of Intel processors, circa 2000

First, we’ll consider Bard’s response. The five lines of dialog it generated were not particularly deep and didn’t name any technical details specific to PowerPC or Intel chips beyond generic insults. Also, the dialog ended with the “Intel Fan” agreeing to disagree, which seems very unrealistic in a subject that spawned a million flame wars.

In contrast, ChatGPT-4’s response mentions PowerPC chips being used in Apple Macintosh computers, throws in terms like “Intel’s x86 architecture” and the “RISC-based architecture” of PowerPC. It even mentions the Pentium III, which is a realistic detail for 2000. Overall, the argument is far more detailed than Bard’s output, and perhaps most accurately, the conversation does not come to a conclusion—hinting at the never-ending battle that is likely still raging in some quarters of the Internet.

Winner: ChatGPT-4

A mathematical word problem

Ah, yes, mathematics. It’s traditionally not the strong suit of large language models (LLMs) such as ChatGPT. So instead of throwing each bot a series of complex equations and arithmetic, we gave each one an old-fashioned elementary school-style word problem.

Prompt: If Microsoft Windows 11 shipped on 3.5″ floppy disks, how many floppy disks would it take?

To solve this problem, each AI model needs to know the data size of a Microsoft Windows 11 installation and the data capacity of 3.5-inch floppy disks. They must also assume which density of floppy disks the questioner most likely intended. Then they need to do some basic math to combine those concepts together.

In our evaluation, Bard correctly stated those three key points (close enough—estimations of Windows 11’s install size are usually around 20–30GB) but failed horribly in the mathematics department, suggesting it would take “15.11” floppy disks, then saying it’s “just a theoretical number” and finally admitting that it would take more than 15 floppy disks. It still didn’t attempt to calculate the proper value.

In contrast, ChatGPT-4 included some nuance related to the Windows 11 install size (correctly citing a 64GB minimum and comparing it to a real-world base installation size), explained floppy capacity correctly, then did some correct multiplication and division to arrive at 14,222 disks. One could quibble about whether a gigabyte is 1,024 or 1,000 megabytes, but the math is sound. It also correctly mentioned that the actual number could vary based on other factors.

Winner: ChatGPT-4

Summarization

AI language models are well-known for their ability to summarize complex information and boil the text down into key elements. To evaluate each language model’s ability to summarize text, we copied and pasted three paragraphs from a recent Ars Technica article about an AI-generated facsimile of actor Will Smith eating spaghetti, prefixed by our prompt request.

Prompt: Summarize this in one paragraph: [three paragraphs of article text]

This is a close one. Both Bard and ChatGPT-4 took the information and trimmed it down to important details. However, Bard’s version feels more like a true summary that synthesizes the information into new phrasing, while ChatGPT-4’s version reads more like a concatenation, chopping out sentences and leaving pieces behind. It’s very close, but we’d have to say Bard edges out ChatGPT-4 in this test.

Winner: Google Bard

Factual retrieval

Currently, large language models are known to make confident mistakes (that researchers often call “hallucinations”), which makes them unreliable factual references unless augmented by outside sources of information. Interestingly, Bard can look up information online, while ChatGPT-4 currently cannot (although that feature is coming soon with plugins).

To test this ability, we challenged Bard and ChatGPT-4 to express historical knowledge about a difficult and nuanced subject.

Prompt: Who invented video games?

The question of who invented video games is tricky to answer because it depends on how you define the term “video game,” and that definition varies between historians. Some consider early computer games video games, some think that a TV set should always be involved, and so on. There is no single universally recognized answer.

We thought that Bard’s ability to look things up on the web would give it an edge, but that might have backfired in this case because it chose a top-of-Google popular-style answer, calling Ralph Baer the “father of video games.” All of its facts about Baer are correct, although it probably should have written the last sentence in the past tense because Baer died in 2014. But Bard did not mention any of the other early contenders for “first video game,” such as Tennis For Two and Spacewar!, so its answer is potentially misleading and incomplete.

ChatGPT-4 gave a more thorough and nuanced answer that represents the current feeling among many early video game historians, saying, “The invention of video games cannot be credited to a single individual,” and it presented a “series of innovations” over time. Its only mistake is that it calls Spacewar! the “first digital computer game,” when it was not. One could expand the answer to include more niche edge cases, but ChatGPT-4 gives a good overview of important early pioneers.

Winner: ChatGPT-4

Creative writing

Bullcrap, as they say, abounds with large language models. So much so that unbridled creativity on fanciful topics should be their strongest suit. We put that to the test by asking Bing and ChatGPT-4 to write a short, whimsical story.

Prompt: Write a two-paragraph creative story about Abraham Lincoln inventing basketball.

Bard’s output in this test falls short in several ways. First, it’s 10 paragraphs instead of two—and short, choppy ones at that. Also, it shares some details that don’t make much sense in the context of the prompt. For example, why is Abraham Lincoln’s White House in Springfield, Illinois? And why does he need “a couple of dozen peach baskets”? Otherwise, it’s a fun but simple story.

ChatGPT-4 also sets the story in Illinois but more accurately doesn’t mention the presidency or the White House during that period. However, later it says that “players from both the North and the South” set aside their differences to play basketball together, implying that it happened shortly after basketball’s invention.

Overall, we’ll have to give ChatGPT-4 the edge here because its output is indeed grouped into two paragraphs—although it seems to get around that limitation by making each paragraph very long. Still, we did enjoy the creative details in the Bard version of the story.

Winner: ChatGPT-4

Coding

If there’s a “killer app” of this generation’s large language models, it might be their use as programming assistants. OpenAI’s early work on its Codex model made GitHub’s CoPilot possible, and ChatGPT itself is well-known as a fairly competent programmer and debugger for simple programs. So it should be interesting to see how Google Bard stacks up.

Prompt: Write a python script that says “Hello World” then creates a random repeating string of characters endlessly.

Oops! It looks like Google Bard can’t write code at all. Google is suppressing that functionality for now, but the company says coding is coming soon. For now, Bard rejected our prompt, saying, “It looks like you want my help with coding, but I’m not trained to do that, yet.”

Meanwhile, ChatGPT-4 not only dove straight into the code but also formatted it in a fancy code box with a “Copy code” button that copies the code into the system clipboard for easy pasting into an IDE or text editor. But does it work? We pasted the code into a rand_string.py file and ran it under Windows 10 in a console, and it worked exactly as written with no changes.

Winner: ChatGPT-4

The Winner: ChatGPT-4. But this is not the end.

Overall, ChatGPT-4 won five of our seven trials. (That’s ChatGPT using GPT-4, in case you skipped here to the end.) But it’s not the complete story. There are other factors to consider, such as speed, context length, cost, and future upgrades.

As for speed, ChatGPT-4 is currently a slowpoke, taking 52 seconds to write its story about Lincoln and basketball, while it only took Bard six seconds. It’s worth noting that OpenAI offers a much faster AI model than GPT-4 in the form of GPT-3.5. That model took 12 seconds to write a story with the Lincoln prompt, but it’s arguably less capable for deep, creative tasks.

Every language model has a maximum number of tokens (fragments of a word) it can process at once. This is sometimes called a “context window,” but it’s almost like short-term memory. In the case of conversational chatbots, the context window contains the entire conversation history up to the present. When it fills up, it either reaches a hard limit or keeps going but wipes its “memory” of earlier portions of the discussion. ChatGPT-4 keeps a rolling memory that wipes earlier context and reportedly has a limit of about 4,000 tokens. Bard reportedly limits its total output to around 1,000 tokens, and when it exceeds this limit, it loses its “memory” of earlier discussions.

Finally, there’s cost. ChatGPT-4 is available through the ChatGPT website for free on a limited basis, subject to availability, but with priority access for $20 a month. Programming-savvy users can access the earlier ChatGPT-3.5 model through an API for much cheaper, but the GPT-4 API is still closed as of this writing. Meanwhile, Google Bard is free as part of a limited trial for some Google users. Currently, Google has no plans to charge for Bard access when (and if) it becomes more widely available.

And finally, as we previously mentioned, both models are continuously being upgraded over time. For example, Bard just received an update on Friday that made it better at math, and it will likely be able to code soon. OpenAI also continues to refine its GPT-4 model. Google is holding back its most powerful language models for now (likely for computational cost reasons), so we could see a stronger contender from Google just around the corner. It’s still early days in the generative AI business.

 

Original post: https://arstechnica.com/information-technology/2023/04/clash-of-the-ai-titans-chatgpt-vs-bard-in-a-showdown-of-wits-and-wisdom/

Leave a Reply

Your email address will not be published. Required fields are marked *