Show HN: Understudy – Teach a desktop agent by demonstrating a task once

bayes-song 96 points 39 comments March 12, 2026
github.com · View on Hacker News

I built Understudy because a lot of real work still spans native desktop apps, browser tabs, terminals, and chat tools. Most current agents live in only one of those surfaces. Understudy is a local-first desktop agent runtime that can operate GUI apps, browsers, shell tools, files, and messaging in one session. The part I'm most interested in feedback on is teach-by-demonstration: you do a task once, the agent records screen video + semantic events, extracts the intent rather than coordinates, and turns it into a reusable skill. Demo video: https://www.youtube.com/watch?v=3d5cRGnlb_0 In the demo I teach it: Google Image search -> download a photo -> remove background in Pixelmator Pro -> export -> send via Telegram. Then I ask it to do the same for Elon Musk. The replay isn't a brittle macro: the published skill stores intent steps, route options, and GUI hints only as a fallback. In this example it can also prefer faster routes when they are available instead of repeating every GUI step. Current state: macOS only. Layers 1-2 are working today; Layers 3-4 are partial and still early. npm install -g @understudy-ai/understudy understudy wizard GitHub: https://github.com/understudy-ai/understudy Happy to answer questions about the architecture, teach-by-demonstration, or the limits of the current implementation.

Discussion Highlights (9 comments)

sukhdeepprashut

2026 and we still pretend to not understand how llms work huh

abraxas

One more tool targeting OSX only. That platform is overserved with desktop agents already while others are underserved, especially Linux.

jedreckoning

cool idea. good idea doing a demo as well.

sethcronin

Cool idea -- Claude Chrome extension as something like this implemented, but obviously it's restricted to the Chrome browser.

rybosworld

I have a hard time believing this is robust.

walthamstow

It's a really cool idea. Many desktop tasks are teachable like this. The look-click-look-click loop it used for sending the Telegram for Musk was pretty slow. How intelligent (and therefore slow) does a model have to be to handle this? What model was used for the demo video?

8note

sounds a bit sketch? learning to do a thing means handling the edge cases, and you cant exactly do that in one pass? when ive learned manual processes its been at least 9 attempts. 3 watching, 3 doing with an expert watching, and 3 with the expert checking the result

obsidianbases1

Nice work. I scanned through the code and found this file to be an interesting read https://github.com/understudy-ai/understudy/blob/main/packag...

skeledrew

Interested, and disappointed that it's macOS only. I started something similar a while back on Linux, but only got through level 1. I'll take some ideas from this and continue work on it now that it's on my mind again.

Semantic search powered by Rivestack pgvector
3,471 stories · 32,344 chunks indexed