AI’s progress has hit a important constraint: entry to real-world information. Whereas public datasets and net scraping powered AI’s early breakthroughs, immediately’s fashions demand proprietary information from hospitals, enterprises, studios, and controlled environments – information that’s been locked away behind authorized, technical, and governance limitations. This bottleneck impacts each stage of AI improvement, from pre-training to analysis, forcing mannequin builders to depend on artificial information that may’t totally replicate the complexity of human habits and real-world situations. Protege addresses this elementary hole by making a platform the place information holders can license their proprietary datasets whereas sustaining privateness, IP protections, and compliance – enabling AI builders to entry medical data, media content material, audio conversations, movement seize information, and different hard-to-find data at scale. Working with information companions throughout healthcare, media, and movement seize, the corporate has aggregated entry to billions of knowledge factors, together with over 3B medical notes, 100M medical photos, 500K+ hours of video content material, and 500K+ hours of audio throughout 50+ languages. With their latest acquisition of Calliope Networks and partnerships spanning from nearly all of “Magnificent Seven” tech corporations to lots of of knowledge suppliers, Protege is changing into the central infrastructure layer connecting proprietary information with AI improvement wants.
AlleyWatch sat down with Protege CEO and Co-Founder Bobby Samuels to be taught extra in regards to the enterprise, its future plans, latest funding spherical, and far way more…Who have been your buyers and the way a lot did you increase?
Protege raised $30M in a Sequence A1 spherical led by Andreessen Horowitz (a16z). The financing expands the corporate’s $25M Sequence A from August 2025 and brings complete funding to roughly $65M since Protege’s founding in 2024. The spherical additionally consists of follow-on participation from current buyers corresponding to Footwork, CRV, Bloomberg Beta, Flex Capital, Shaper Capital, and extra.
Inform us in regards to the services or products that Protege presents.
Protege is an AI information platform unlocking entry to trusted, real-world information at scale. We’re remodeling how the world’s actual information powers AI — enabling individuals and establishments to contribute their information safely and form intelligence constructed on integrity, experience, and human goal. We work with personal information holders throughout healthcare, media, and different industries to license and curate high-quality datasets that AI builders want for coaching, analysis, and benchmarking. Our function is to behave because the connective tissue between these two sides, making it attainable to unlock beneficial information whereas preserving privateness, IP rights, and regulatory compliance.
At its core, Protege is about turning information that’s traditionally been siloed, delicate, or underutilized right into a responsibly ruled asset. We concentrate on real-world information throughout industries as a result of that’s what in the end determines how AI techniques carry out as soon as they depart the lab and function in actual environments.
What impressed the beginning of Protege?
Whereas AI fashions and computer systems have superior quickly, entry to the best information has turn out to be a bottleneck. The overwhelming majority of the world’s most useful information, particularly in regulated industries like healthcare, just isn’t publicly obtainable, and artificial or manufactured information can’t totally replicate real-world complexity. Protege was born from the assumption that AI’s subsequent leap will come from unlocking real-world information, ethically sourced, expert-curated, and shared on human phrases.
My co-founders and I had spent years working in privacy-first information ecosystems, and we noticed a possibility to use these classes to AI. We believed there was a greater path ahead than information scraping from the web – one which compensated information holders, revered privateness, and enabled AI builders to coach techniques that may really work in the true world.
How is Protege completely different?
We’re constructed round licensed, real-world information from day one. When AI builders come to Protege, they’re in search of real-world information: essentially the most genuine sign of how individuals and techniques really behave. This isn’t artificial information created by AI nor manufactured information created to simulate human habits. Throughout each stage of the AI improvement lifecycle — from pre-training to post-training to fine-tuning to analysis — AI builders want this information. They’re trying throughout modalities and industries: healthcare, video, audio, movement seize, gaming, manufacturing, life sciences, actual property, finance, schooling, and lots of extra. Foundational, multi-modal model-builders (together with nearly all of the Magnificent Seven) now work with us throughout a number of domains together with dozens of different mannequin builders.
We additionally concentrate on curation and fit-for-purpose datasets slightly than solely quantity. As AI builders’ wants have matured, they’ve shifted from “extra information” to “the best information,” and our platform is designed to fulfill that demand, whether or not it’s consultant medical situations in healthcare, extremely particular content material in media, or up to date audio and movement seize wants. We unlock income for information suppliers as effectively, empowering information stewards to share their information property safely and assist AI be taught responsibly, in order that progress is each highly effective and consultant of the broader human inhabitants.
What market does Protege goal and the way massive is it?
Protege sits on the intersection of AI improvement and proprietary information, serving each AI builders and information holders throughout a number of verticals, corresponding to healthcare, media, motion-capture, and extra. Basically, there are 3 bottlenecks to AI progress: compute, fashions, and information. There are already a number of corporations within the first two classes price billions, doubtlessly trillions. There’s but to be a dominant participant within the information that’s wanted for AI improvement, and that’s the hole that Protege goals to fill.
As AI turns into extra multimodal and extra embedded in real-world workflows, demand for licensed, domain-specific information will solely develop. We consider fixing AI’s information entry downside is a generational alternative, and the market spans almost each trade touched by AI.
What’s your small business mannequin?
We at the moment function as a two-sided information platform for AI improvement, the place AI builders buy licensed datasets and information holders are compensated via structured agreements. We earn income for facilitating entry and offering value-added providers like curation and de-identification the place applicable. Over time, we have now additionally expanded into benchmarks and analysis datasets to help AI improvement throughout the complete lifecycle, not simply preliminary coaching.

How are you making ready for a possible financial slowdown?
In our trade, we’ve seen an acceleration in demand throughout the completely different verticals that we serve. Specifically, we really feel well-positioned to reap the benefits of not solely the rising want for information for AI improvement but additionally the rising development in the direction of moral information licensing for AI throughout industries.
This has the potential to offer different corporations, organizations, and rights-holders who could also be in industries which are prone to financial slowdowns an extra income stream alternative that didn’t beforehand exist. These are win-win conditions the place information rights holders can profit from their current property, and we as an organization are in a position to assist bundle that information and join information holders with AI builders actively searching for out these proprietary information sources. This helps to insulate us to broader market circumstances whereas additionally offering others alternatives past their current enterprise traces.
What was the funding course of like?
Protege has been rising rapidly, and we have been seeing clear alerts out there that there was a possibility to boost capital in a means that might meaningfully speed up what we have been already doing: increasing information partnerships, hiring thoughtfully, and staying versatile round potential strategic alternatives. a16z stood out as the best accomplice given their depth in information infrastructure, AI, and healthcare, in addition to the long-term orientation they bring about to firm constructing.
This spherical provides us extra alternatives to speed up product improvement, considerably develop Protege’s information community into new domains and information codecs, deepen partnerships with main establishments, and scale the crew and infrastructure required to ship AI-ready and rights-protected entry to real-world information. On the identical time, we get to carry on a world-class accomplice who’s deeply linked to the ecosystem wherein we function.
Having Daisy Wolf, Companion at a16z, put money into us was an necessary a part of that call, given her expertise in healthcare and information is very aligned with the place we’re going. The spherical moved rapidly and included continued participation from our current buyers, which we see as a robust vote of confidence in each the enterprise and the route we’re heading.
What are the largest challenges that you simply confronted whereas elevating capital?
An enormous issue that’s usually neglected is how we convey our imaginative and prescient for the world and the way we as an organization match into it when the world is altering so rapidly. That is very true within the AI house, the place new fashions are launched what looks as if each week, and innovation (and disruption) is occurring left and proper. So having a transparent and crisp imaginative and prescient that we will clearly talk to buyers is paramount to making sure that we see eye-to-eye with them rapidly. This helps buyers develop conviction in our imaginative and prescient and mission rapidly, whereas additionally guaranteeing that we really feel assured that we’ve chosen the best accomplice for the lengthy haul.
What elements about your small business led your buyers to put in writing the verify?
For years, the open web powered speedy advances in AI—however that useful resource is now largely exhausted. Public datasets, corresponding to Frequent Crawl, seize solely a small slice of the net, whereas the overwhelming majority of high-value information lives offline, inside hospitals, enterprises, studios, and different regulated or proprietary environments. The true bottleneck has shifted to accessing real-world information responsibly. Buyers see Protege as important infrastructure for that subsequent section, enabling licensed, privacy-preserving entry to the info AI techniques have to carry out reliably in follow. As well as, people famous the energy of the crew from a wide range of backgrounds, starting from healthcare information to media to tech startups and extra.
For years, the open web powered speedy advances in AI—however that useful resource is now largely exhausted. Public datasets, corresponding to Frequent Crawl, seize solely a small slice of the net, whereas the overwhelming majority of high-value information lives offline, inside hospitals, enterprises, studios, and different regulated or proprietary environments. The true bottleneck has shifted to accessing real-world information responsibly. Buyers see Protege as important infrastructure for that subsequent section, enabling licensed, privacy-preserving entry to the info AI techniques have to carry out reliably in follow. As well as, people famous the energy of the crew from a wide range of backgrounds, starting from healthcare information to media to tech startups and extra.
What are the milestones you propose to realize within the subsequent six months?
Within the subsequent six months, Protege goals to develop its verticals previous healthcare, audiovisual, and movement seize, with the objective of changing into a trusted supply of licensed, real-world information throughout domains.
Past simply coaching information, the Protege platform plans to evolve to help all phases of the AI mannequin improvement cycle, corresponding to pre-training, post-training, fine-tuning, analysis & benchmarking, and inference, into its infrastructure, permitting for a extra superior analysis.
What recommendation are you able to provide corporations in New York that should not have a contemporary injection of capital within the financial institution?
Just like earlier eras, the one benefit that smaller corporations and startups have that incumbents don’t is velocity. Within the age of AI, that is very true – the price of growing new merchandise, testing new concepts, and reaching new companions at scale has by no means been sooner. Whereas this will trigger conventional channels to turn out to be saturated, it does additionally create a world the place it’s by no means been simpler for nice concepts to succeed in the best audiences that care about what you might be constructing.
In consequence, leaning into the velocity benefit is sort of by no means a nasty thought within the early phases. It will increase the floor space of alternatives, whereas additionally creating extra probabilities to find new insights and pivot as needed within the ever-changing panorama.
The place do you see the corporate going now over the close to time period?
Over the close to time period, Protege is targeted on changing into the central platform for real-world, licensed information utilized in AI improvement throughout industries, whereas additionally being the main voice in AI information greatest practices for mannequin builders. We consider that human information that’s reflective of human exercise in the true world will proceed to play a higher and higher a part of AI improvement. We goal to be the trusted chief for any such information within the broader AI ecosystem.
What’s your favourite winter vacation spot in and across the metropolis?
I’m a giant fan of a brand new AI-powered karaoke studio known as Beatbox. It’s a ton of enjoyable and an important house. (Although full disclosure, my spouse and her cofounder opened it up late final yr.)
