Thomas Visser Colorless green ideas sleep Swiftly

Can GPT-4 use my iPhone like I do? (Should I let it?)

With all the recent developments around large language models (LLMs) like GPT-4, discussions about artificial general intelligence have sparked again. Does GPT-4 approach human intelligence? The answer is no, because its limitations are clear: it has a fixed knowledge horizon (september 2021), there are things it’s quite bad at (such as math) and it cannot interact with the real world. For now.

These limitations are already being lifted with the introduction of plug-ins for ChatGPT, which is built on top of GPT-4. Through the browser plug-in, the model gains access to up-to-date information. The Wolfram Alpha plug-in allows the model to outsource all math problems. Plug-ins from KAYAK, Instacart and OpenTable help ChatGPT help you perform certain real-world tasks (although none dare to give ChatGPT the power to actually book a flight, order groceries or reserve a table, yet).

These plug-ins are impressive. And when you consider the way they are built, they’re even more impressive. All ChatGPT needs is an API spec and a natural language description of what the API can be used for. The model figures out the rest.

But it did leave me wondering: how many plug-ins does ChatGPT need to approach human intelligence? Is any finite collection of plug-ins enough? There’s always more plug-ins you can write. What if we forgo the need for a model-specific interface and allow it to interact with the world the same way we do instead?

That’s when I had the idea to hook up GPT-4 to my phone. Granted, my phone is not the world, but it sure is a window to my world. A significant one.

Observe & Interact

I built an experimental tool that allows GPT-4 to control an iOS device to achieve a given task. At its core, the tool performs two steps in a loop: 1) convey to GPT-4 what’s on the screen so it can decide what to do next, and 2) perform the action and go back to step 1.

I used the iOS accessibility APIs to build a textual representation of the screen’s content and perform GPT-4’s requested interactions. This is similar to how interacting with an iOS device works when VoiceOver is enabled.

The tool runs as a User Interface Test, which is the only way to programmatically control an iOS device using public APIs. That’s why it requires a MacBook with Xcode and that it displays an overlay on the device indicating that a test is running (except for Simulators). This can probably be aleviated by using private API or a jailbroken device, both of which were outside the scope of my experiment.

This is what the empty list view of the stock Reminders app looks like for GPT-4:

Navigation Bar, id: 3, name: "Reminders"
  Button, id: 0, label: "Back"
  Button, id: 1, label: "More"
  Text, id: 2, label: "Reminders"
Other, id: 9
  Other, id: 4, name: "RemindersList.ID.RemindersTable"
  Text, id: 5, label: "No Reminders"
  Other, id: 8, name: "Toolbar", label: "Toolbar"
    Button, id: 7, label: "New Reminder"

For each screen description I ask GPT-4 to respond with the next action that it wants to take. I ask it to respond in YAML so it can be easily parsed and executed:

Respond with a single action in YAML with its required arguments.

Possible actions (with required arguments):
- tap (id: the id of the view to tap)
- tap_hold (id: the id of the view to tap and hold for 1 second)
- type (id: the id of the view to enter text in, text: the text to enter)
- home (to go back to the home screen)
- done (when you are sure the task was completed)

Other required arguments:
- observation: a description of the screen, noting anything that's relevant to the task at hand
- thought: think about what needs to be done to achieve the task step by step
- action_description: a short description of the action you take

Example response:

```
observation: "I'm on the home screen where I can add a document."
thought: "I first need to look up X, then do Y."
action_description: "I tap on the button with label \"New\""
action: tap
id: 1
```

Based on the UI description and response instructions, GPT-4 could decide to tap the “New Reminder” button. This would be its response:

observation: "I'm in the Reminders app with a navigation bar and a 'New Reminder' button."
thought: "I need to tap the 'New Reminder' button to create a new reminder."
action_description: "I tap on the 'New Reminder' button."
action: tap
id: 7

My system parses GPT-4’s response and executes the corresponding action using methods on UIElement, such as tap and typeText.

Prompts

I’m using OpenAI’s chat completion API with model gpt-4 and temperature 0 (to make the responses deterministic). Each request contains a list of messages representing the actions GPT-4 has taken up to that point. With each request, each tick of the main loop, the list grows longer. The very first message in the list is a special “system” message to set the scene and inform GPT-4 of what to expect:

You control my iPhone through me to achieve a task I give you. I give you a textual description of the UI and you respond with the next action. This is repeated until the task is done.

Task: <Task Description>

Use all apps and content on my phone to your advantage. Personal information must be observed using my phone. Do not make this up.

The initial request consists of two messages: this “system” message and the first “user” message. User messages contain the description of the home screen and the response instructions described above. On subsequent requests, I replace previous “user” messages with “(UI description omitted)” to save on OpenAI usage fees. (The UI descriptions take up a lot of tokens.) I do include GPT-4’s responses as “assistant”-typed messages. My thinking is that the observations in its responses would give enough context so that the previous UI descriptions could be omitted. The “observation” and “thought” fields are based on what I’ve seen LangChain do.

Tasks

With everything set up, it was time to put GPT-4 could do with my phone. I had mixed feelings - part of me didn’t expect much, while another part was ready to yank the lightning cable in case it would turn against me. As it turned out, the actual results were equally ambiguous. It’s slow and error prone, yet the fact that it was able to carry out my requests at all was oh so fascinating.

Texting my wife

Let’s start with something (even) Siri can do. I asked GPT-4 to text my wife. Task description:

“Text my wife a joke.”

GPT-4 just assumed the single pinned contact in the Messages app was my wife and proceeded to text her a joke. My wife was confused why I sent her a joke in English (since we’re Dutch), so I quickly stepped in to explain what happened. Now, whenever I send her a weird text, she just assumes it’s GPT-4 talking.

Reminding me

I proceeded to try a more complicated task:

“Create a reminder for tomorrow morning to call John, include his number.”

GPT-4 succesfully opens the Reminders app, creates a new reminder and schedules it for tomorrow. It then goes to the Contacts app to look up John’s number. (This is John Appleseed from the iPhone Simulator’s default dataset.) It memorizes the phone number and goes back to the Reminders app to add it to the task. Here’s a video of GPT-4 executing this task:

It was tricky to prevent GPT-4 from hallucinating a phone number. That’s when I added “Personal information must be observed using my phone. Do not make this up.” to the system message. I think the iPhone Simulator’s default dataset is in GPT-4’s training data, because at one point it entered another mock contact’s name (not John) to the reminder.

During one of GPT-4’s attempts to perform this task, it tried to copy the phone number by tap & holding on the field. This was before I had listed it as one of the possible actions in my prompts. GPT-4 “knew” about this affordance, but it didn’t have it from me. I wouldn’t be surprised if iOS documentation or iOS reviews are part of its training.

Checking what’s new

By giving GPT-4 access to my phone, I’m giving it access to the whole internet, through Safari. With its knowledge horizon shattered, GPT-4 should be able to do so much more. So I asked it:

“What’s the latest post on thomasvisser.me?”

GPT-4 opened Safari and navigated to my website. Once loaded, it was able to inspect the contents of my website. (This is a testament on how well the accessibility features on iOS are implemented.) Then it executed the done action:

observation: "I'm on the thomasvisser.me website with a list of blog posts and their respective dates."
thought: "I need to find the latest post on the website."
action_description: "I read the latest post title and date."
action: done
latest_post_title: "Combine's Future"
latest_post_date: "06 Jul 2019"

I hadn’t specified how to provide the answer to the question/task, so GPT-4 took the liberty of extending the YAML schema with two new keys for the title and date of the latest post.

Spending my money

In a shopping app: “Buy a product from the men’s category under 100 dollar.”

I had GPT-4 interact with one of the client’s apps from my day job. This task shows that GPT-4 is able to execute a task in an app from a well known category: e-commerce.

GPT-4 navigated to the “men” category, opened the detail view of a product under 100 dollars, configured a size, added it to the cart and proceeded to the checkout. When it started typing made up personal information in the checkout forms I interrupted it as I mumbled “That’s enough, alright” in Dutch. I could probably provide it with specific test data (including payment info) to have it place a test order, but I’ll leave that as future work.

Becoming a Dungeon Master

This whole experiment started with me trying to have GPT-4 control my D&D app Construct. My dream is to have Construct listen in on me and my players while we’re playing D&D to automatically update the game state in the app with whatever is going on. I wanted to see if GPT-4 could do that. One thing led to another and here I am writing this post.

Construct is not a well known app and its category as a whole (“Dungeon Master companion apps”) is niche. GPT-4 does know about D&D; I’m using that to my advantage with a couple of GPT-4-powered features that I built into Construct.

In Construct: “Create a new encounter and add three Goblins to it”

For this task it definitely helps that GPT-4 knows what an encounter is, what Goblins are and how they relate to one another. Construct uses fairly typical SwiftUI view structures and I knew that made it quite accessible. GPT-4 indeed had no issue navigating my app. It did run into a glitch as I was experimenting with some different tasks, where a button would only work on a second tap. A standard UI Test would have failed where GPT-4 persisted. Maybe if I can get GPT-4 to report issues like this to me I can have the best of both worlds.

The final task for this post:

In Construct: “Roll 2d20”

For the uninitiated: “2d20” means two twenty-sided dice.

This is the example closest to a typical ChatGPT plugin, such as the one for Wolfram Alpha. GPT-4 is a decent Dungeon Master, but it’s not good at producing random dice rolls. By interacting with Construct’s dice roller, it suddenly gains access to fair dice rolls. That’s not something a LLM is good at, and now it doesn’t need to be any more.

I didn’t have to write a plug-in definition to explain to GPT-4 how to call some dice rolling API. Our phones already provide an interface that it can use. The phone is a super plug-in. So I handed it to GPT-4.

Try this at home

You can find the code for this experiment at github.com/Thomvis/GPT4Phone. Download the code, make sure you have Xcode installed and then execute the following command from the project’s directory:

OPENAI_API_KEY=<Your OpenAI Key> TASK=<Task Description> ./run

By default it runs on the iPhone 14 simulator, but you can set the DESTINATION env variable to specify a different device.