Project Overview
Tanna AI required a feature that could deliver real-time video transcriptions, combining both audio and on-screen text extraction. The goal was to ensure seamless transcription of video content — including both the audio track and any additional on-screen text — without overlap, and in real-time. This was particularly beneficial for students using the Tanna AI platform to transcribe lectures and other educational videos.
Workflow Overview
The following diagram illustrates the end-to-end workflow for real-time video transcription.
Workflow Steps
01 Tesseract.js for On-Screen Text Extraction
The workflow begins with Tesseract.js, a JavaScript library for OCR. It captures and processes screenshots taken from the video content to extract any visible text on screen, then sends the extracted text to the Flutter-based media player for further processing.
02 Communication Between Tesseract.js and Flutter
Through JavaScript interop, Tesseract shares the extracted text directly with the Flutter media player. This ensures a smooth transfer of on-screen text data for display or synchronization within the app.
03 Google Cloud Functions Handling Audio Transcription
Simultaneously, the audio track from the video is processed through a Google Cloud Function written in Python. This function uses the OpenAI Whisper API to transcribe the spoken audio in real-time, returning the transcript to ensure audio and text remain in sync without overlaps.
04 Storage and Real-Time Updates via Firebase Firestore
Both transcripts (audio and on-screen text) are saved in Firebase Firestore, providing a real-time database to store and sync transcription data so updates are immediately reflected on the user's device.
05 Synchronized Output on Flutter Media Player
Finally, the Flutter-based media player integrates both streams — OCR text from Tesseract.js and audio transcripts from Whisper — ensuring both are displayed accurately in real-time without duplication or overlap.
Challenges
The primary challenge was synchronizing two separate data streams:
- Audio from the video, processed using OpenAI's Whisper API.
- Text present on-screen, extracted using Tesseract OCR.
Both streams needed to be processed simultaneously, without overlap, and with high accuracy to filter out gibberish in real-time.
Solution: Technical Architecture
01 Flutter Integration with JavaScript
I utilized Flutter's JS interop to bridge Flutter and JavaScript, allowing the app to send screenshots for text transcription and receive results back in real time.
class JSInterop {
static void init() {
_shareScreenshotTranscript = allowInterop(_shareScreenshotTranscriptDart);
}
}
@JS('shareScreenshotTranscript')
external set _shareScreenshotTranscript(void Function(String transcript) f);
ValueNotifier<String> shareScreenshotTranscript = ValueNotifier('');
void _shareScreenshotTranscriptDart(String transcript) {
shareScreenshotTranscript.value = transcript;
}
02 JavaScript Text Extraction Using Tesseract
Tesseract was used for OCR-based text extraction, taking cropped screenshots from the video and extracting text in real time.
async function extractTextFromBlob(blob) {
var worker = await Tesseract.createWorker("eng");
const { data: { text } } = await worker.recognize(blob, { psm: 1 });
await worker.terminate();
shareScreenshotTranscript(text);
}
03 Real-Time Screenshot Processing
The app periodically captures screenshots and processes cropped regions of the video to improve transcription efficiency, minimizing unnecessary OCR operations.
void _cropAndTranscribeScreenshot(ByteBuffer screenshotBuffer) {
final blob = html.Blob([screenshotBuffer]);
js.context.callMethod('cropAndTransribeImage', [
blob,
croppedRegionCoordinates?.x,
croppedRegionCoordinates?.y,
croppedRegionCoordinates?.width,
croppedRegionCoordinates?.height,
croppedRegionCoordinates?.resizeWidth,
croppedRegionCoordinates?.resizeHeight,
]);
}
04 Audio Transcription Using Google Cloud Functions
A Google Cloud Function processes the audio extracted from video using the Whisper model, ensuring both streams (audio and text) are synchronized in real time.
def transcribe_video_file(json_request):
file_location = json_request.get('file_location')
user_id = json_request.get('user_id')
blob = bucket.blob(file_location)
temp_video_file = '/tmp/temp_video' + os.path.splitext(file_location)[1]
blob.download_to_filename(temp_video_file)
audio_data = extract_audio_from_video(temp_video_file)
transcriptions = transcribe_audio(audio_data)
transcript = transcriptions["results"]["channels"][0]["alternatives"][0]["transcript"]
return {'message': 'File transcribed successfully.', 'transcript': transcript}
Key Features
- Real-time transcription of both audio and on-screen text.
- Multi-stream processing (audio and video text) without overlap.
- JS-Interop integration for seamless interaction between Flutter and JavaScript.
- Cloud-based audio transcription using OpenAI's Whisper API.
- Screenshot cropping functionality for focused text extraction.
Challenges & Solutions
Challenge Ensuring that Tesseract could accurately process screenshots of video frames and extract text in real time.
Solution By cropping specific regions of the image, Tesseract's accuracy was significantly improved and gibberish output was minimized. Extracted text was immediately sent back to Flutter via JS interop.
Challenge Managing synchronization between the two streams (audio and on-screen text) without overlap.
Solution The two streams were processed in parallel using cloud functions, and Firebase was used to deliver real-time updates to the app to ensure they remained in sync.
Impact
The real-time transcription feature significantly enhanced the accessibility of video content for students, resulting in positive feedback and additional sales for Tanna AI. The seamless integration of multiple technologies ensured high accuracy and strong user satisfaction.