Live Transcriptions

Project Overview

Tanna AI required a feature that could deliver real-time video transcriptions, combining both audio and on-screen text extraction. The goal was to ensure seamless transcription of video content — including both the audio track and any additional on-screen text — without overlap, and in real-time. This was particularly beneficial for students using the Tanna AI platform to transcribe lectures and other educational videos.

Workflow Overview

The following diagram illustrates the end-to-end workflow for real-time video transcription.

End-to-End Workflow

Video Input

Student lecture · Educational content

Two Parallel Streams

Audio Track

Extracted from video

Cloud Function

OpenAI Whisper API · Python

Video Frames

Periodic screenshots

Tesseract.js OCR

On-screen text · JS interop

Firebase Firestore

Both streams saved · Real-time sync

Flutter Media Player

Synchronized output · No overlap

Workflow Steps

01 Tesseract.js for On-Screen Text Extraction

The workflow begins with Tesseract.js, a JavaScript library for OCR. It captures and processes screenshots taken from the video content to extract any visible text on screen, then sends the extracted text to the Flutter-based media player for further processing.

02 Communication Between Tesseract.js and Flutter

Through JavaScript interop, Tesseract shares the extracted text directly with the Flutter media player. This ensures a smooth transfer of on-screen text data for display or synchronization within the app.

03 Google Cloud Functions Handling Audio Transcription

Simultaneously, the audio track from the video is processed through a Google Cloud Function written in Python. This function uses the OpenAI Whisper API to transcribe the spoken audio in real-time, returning the transcript to ensure audio and text remain in sync without overlaps.

04 Storage and Real-Time Updates via Firebase Firestore

Both transcripts (audio and on-screen text) are saved in Firebase Firestore, providing a real-time database to store and sync transcription data so updates are immediately reflected on the user's device.

05 Synchronized Output on Flutter Media Player

Finally, the Flutter-based media player integrates both streams — OCR text from Tesseract.js and audio transcripts from Whisper — ensuring both are displayed accurately in real-time without duplication or overlap.

Challenges

The primary challenge was synchronizing two separate data streams:

Audio from the video, processed using OpenAI's Whisper API.
Text present on-screen, extracted using Tesseract OCR.

Both streams needed to be processed simultaneously, without overlap, and with high accuracy to filter out gibberish in real-time.

Solution: Technical Architecture

01 Flutter Integration with JavaScript

I utilized Flutter's JS interop to bridge Flutter and JavaScript, allowing the app to send screenshots for text transcription and receive results back in real time.

Dart

class JSInterop {

  static void init() {

    _shareScreenshotTranscript = allowInterop(_shareScreenshotTranscriptDart);

  }

}


@JS('shareScreenshotTranscript')

external set _shareScreenshotTranscript(void Function(String transcript) f);


ValueNotifier<String> shareScreenshotTranscript = ValueNotifier('');


void _shareScreenshotTranscriptDart(String transcript) {

  shareScreenshotTranscript.value = transcript;

}

02 JavaScript Text Extraction Using Tesseract

Tesseract was used for OCR-based text extraction, taking cropped screenshots from the video and extracting text in real time.

JavaScript

async function extractTextFromBlob(blob) {

  var worker = await Tesseract.createWorker("eng");

  const { data: { text } } = await worker.recognize(blob, { psm: 1 });

  await worker.terminate();

  shareScreenshotTranscript(text);

}

03 Real-Time Screenshot Processing

The app periodically captures screenshots and processes cropped regions of the video to improve transcription efficiency, minimizing unnecessary OCR operations.

Dart

void _cropAndTranscribeScreenshot(ByteBuffer screenshotBuffer) {

  final blob = html.Blob([screenshotBuffer]);

  js.context.callMethod('cropAndTransribeImage', [

    blob,

    croppedRegionCoordinates?.x,

    croppedRegionCoordinates?.y,

    croppedRegionCoordinates?.width,

    croppedRegionCoordinates?.height,

    croppedRegionCoordinates?.resizeWidth,

    croppedRegionCoordinates?.resizeHeight,

  ]);

}

04 Audio Transcription Using Google Cloud Functions

A Google Cloud Function processes the audio extracted from video using the Whisper model, ensuring both streams (audio and text) are synchronized in real time.

Python

def transcribe_video_file(json_request):

  file_location = json_request.get('file_location')

  user_id = json_request.get('user_id')


  blob = bucket.blob(file_location)

  temp_video_file = '/tmp/temp_video' + os.path.splitext(file_location)[1]

  blob.download_to_filename(temp_video_file)

  audio_data = extract_audio_from_video(temp_video_file)


  transcriptions = transcribe_audio(audio_data)

  transcript = transcriptions["results"]["channels"][0]["alternatives"][0]["transcript"]

  return {'message': 'File transcribed successfully.', 'transcript': transcript}

Key Features

Real-time transcription of both audio and on-screen text.
Multi-stream processing (audio and video text) without overlap.
JS-Interop integration for seamless interaction between Flutter and JavaScript.
Cloud-based audio transcription using OpenAI's Whisper API.
Screenshot cropping functionality for focused text extraction.

Challenges & Solutions

Challenge Ensuring that Tesseract could accurately process screenshots of video frames and extract text in real time.

Solution By cropping specific regions of the image, Tesseract's accuracy was significantly improved and gibberish output was minimized. Extracted text was immediately sent back to Flutter via JS interop.

Challenge Managing synchronization between the two streams (audio and on-screen text) without overlap.

Solution The two streams were processed in parallel using cloud functions, and Firebase was used to deliver real-time updates to the app to ensure they remained in sync.

Impact

The real-time transcription feature significantly enhanced the accessibility of video content for students, resulting in positive feedback and additional sales for Tanna AI. The seamless integration of multiple technologies ensured high accuracy and strong user satisfaction.

AI-Powered Live Video Transcriptions for Real-Time Accuracy