This blog post was created for the purposes of entering the Gemini Live Agent Challenge hackathon on Devpost. #GeminiLiveAgentChallenge
The problem: AI assistants are trapped in browser tabs
We spend our entire day staring at screens, but every AI assistant lives inside a browser tab. To get help, you context-switch away from what you're doing, type out a description of what you're looking at, wait for a text response, then manually carry out whatever it said. It's friction at every step.
We wanted something different. An AI that's just there — floating on your desktop, watching what you see, listening when you speak, acting when you ask. No typing. No tab switching. No copy-pasting screenshots into a chat window.
That's SensAI.
What we built
SensAI is an Electron desktop app that uses Google's Gemini Live API for real-time voice conversation with screen vision. She lives on your screen as a floating animated orb and has over 20 computer-control tools — mouse, keyboard, file operations, shell commands, screen annotations, and more.
But the real differentiator is the MCP (Model Context Protocol) tool system. Instead of hardcoding a fixed set of capabilities, SensAI reads a config file, spawns MCP servers via stdio, discovers their tools at session start, and passes them to Gemini as function declarations. Plug in a GitHub server, a filesystem server, a Slack server — she can use anything you give her. This makes SensAI a platform, not just an app.
Architecture: how Gemini connects to everything
User (Voice + Screen)
│
▼
┌─────────────────────────────────────────────┐
│ Electron Desktop App │
│ │
│ Main Process ─→ Floating Orb ─→ MCP Client │
│ Offscreen Capture Personas Tool Router │
│ Local Memory Actions │
└──────────┬────────────────────────┬─────────┘
│ PCM Audio │
│ JPEG Frames │
▼ │
┌──────────────────────────┐ │
│ Google Cloud Platform │ │
│ │ │
│ Gemini Live API │ │
│ (voice+vision+tools) │ │
│ │ │
│ Firestore │ │
│ (memory, convos, stats) │ │
│ │ │
│ Google Search │ │
│ Code Execution │ │
└──────────┬───────────────┘ │
│ Function calls │
└────────────────────────┘
│
▼
┌───────────────────┐
│ MCP Servers │
│ (user-configured,│
│ any tool) │
└───────────────────┘
The data flow: user speaks, audio is captured as PCM 16kHz mono and sent to Gemini Live API over WebSocket. Screen frames are captured in an offscreen renderer process (to avoid blocking the main thread) and sent as JPEG. Gemini responds with audio at 24kHz plus optional function calls. Function calls route through the MCP client to whichever server handles that tool, and the result goes back to Gemini.
Google AI and Cloud services we used
Gemini Live API (@google/genai SDK)
This is the brain. The live.connect() method gives us a bidirectional WebSocket with real-time audio streaming, vision (screen frames as JPEG), and function calling — all in one session. The barge-in support is what makes it feel like talking to a real person: you can interrupt SensAI mid-sentence and she immediately stops and listens.
We register 20+ tools as Gemini function declarations — built-in tools (mouse, keyboard, memory, files) plus whatever MCP tools are discovered at session start. Gemini decides which tools to call based on the conversation context and what it sees on screen.
Google Cloud Firestore
Memory persistence was important — SensAI remembers your name, preferences, and context across sessions. We built a sync module (electron/firestore-sync.js) that pushes memory to Firestore on every save and pulls from cloud on session start. Conversations and session analytics (persona used, tools invoked, duration) are also archived.
The sync is fire-and-forget: if Firestore is unreachable, the app works perfectly offline with local JSON files. When it reconnects, cloud wins on merge conflicts so your data is always up to date.
Google Search + Code Execution
These are native Gemini tools registered alongside our custom tools. Ask SensAI to search something and she'll use Google Search. Ask her a math question and she'll write and execute code. No extra setup needed — they come free with the Gemini session.
The hardest bugs we squashed
Screen capture was freezing the entire app
Our first implementation called desktopCapturer.getSources() every second on Electron's main process. Each call takes 100-400ms and blocks all IPC, window management, and audio routing. On a laptop with GPU switching, the app stuttered badly.
The fix: move screen capture to a hidden offscreen BrowserWindow. The capture renderer uses a MediaStream from getUserMedia with chromeMediaSource: 'desktop', grabs frames via canvas.drawImage(video) + toBlob(), and sends them back via IPC. The main thread now only receives a lightweight base64 string — zero blocking.
SensAI randomly stopped talking mid-sentence
This was the most frustrating bug. Chrome's AudioContext starts in a suspended state due to autoplay policy. Our code called resume() but didn't await it — then immediately scheduled audio chunks on the still-suspended context. The chunks were silently dropped when currentTime jumped forward after resume completed.
The fix: a queue-based audio scheduler. Incoming chunks go into an array. An async drainAudioQueue() function awaits resume() before scheduling anything. A re-entrancy guard prevents multiple drains from running simultaneously. Both flushPlayback() (user barge-in) and stopPlayback() clear the queue.
Base64 encoding was blowing the call stack
The audio capture used btoa(String.fromCharCode(...bytes)) with a spread operator on an 8192-byte Uint8Array. The spread expands every byte as a separate argument, which causes massive GC pressure 4 times per second. Replaced with a simple for loop — immediate improvement.
The MCP integration: making SensAI a platform
The MCP client (electron/mcp-client.js) speaks JSON-RPC over stdio. On session start:
- Reads
mcp_config.jsonfor server definitions - Spawns each server as a child process
- Sends the MCP handshake (
initialize→tools/list) - Converts discovered tool schemas to Gemini function declarations
- Merges them with built-in tools
When Gemini calls a tool the MCP client doesn't recognize as built-in, it routes the call to the correct server via tools/call and returns the result. SensAI can even add new MCP servers to herself by voice — she writes to the config file and tells you to restart the session.
This is what makes SensAI fundamentally different from other desktop assistants. She doesn't have a fixed capability set. You decide what she can do.
What we learned
- Electron's desktopCapturer is a trap — always capture in a renderer process, never the main process. The performance difference is night and day.
- Always await AudioContext.resume() — scheduling audio on a suspended context silently drops chunks. This isn't documented well anywhere.
- Voice-first UX is a different paradigm — confirmation flows, barge-in handling, and audio state feedback matter more than pixel-perfect layouts.
- MCP is elegant but stdio has edge cases — partial JSON lines, server initialization timeouts, and lifecycle management require careful handling.
- Firestore sync should be fire-and-forget — never block the app waiting for cloud. Local-first, sync in the background, cloud wins on conflict.
Try it yourself
SensAI is open source. Clone the repo, add your Gemini API key, and start talking to your desktop.
git clone https://github.com/sophiasophia/sensai-gemini-hackathon.git
cd sensai-gemini-hackathon
npm install
npm start