
Case Study
Security Video
Video Surveillance
At a Glance
AI inference and large-scale video delivery are exposing the limits of general-purpose CPUs and GPUs. As workloads scale, performance bottlenecks shift from raw compute to memory bandwidth, latency, and power efficiency. In AI, inference is splitting into distinct prefill and decode phases with very different hardware requirements. In video, hardware video encoding is increasingly replacing software encoders to meet throughput and real-time constraints. Across both domains, specialized silicon delivers substantially higher performance per watt and enables explicit workload routing, marking a structural shift away from monolithic compute toward purpose-built processors.
How Scale Is Breaking the One-Size-Fits-All Model
Modern computing is seeing a clear shift from general-purpose processors to specialized silicon for demanding workloads. A recent example is Nvidia’s $20 billion deal to license Groq’s technology – a move signaling that the era of one-size-fits-all GPUs for AI inference is ending. In a parallel trend, video streaming giants like Google and Meta are moving video encoding from software on CPUs to custom chips (e.g. Google Argos, Meta MSVP), with specialized vendors (NETINT) offering Video Processing Units (VPUs).
By late 2025, spending on AI inference (running models) surpassed training, marking an “Inference Flip” in the industry. Inference at scale introduced new priorities: ultra-low latency and maintaining model state (context). GPUs, which dominated with ~92% market share in AI, were built as general-purpose accelerators – but now specialized requirements are fragmenting inference workloads faster than GPUs can generalize. The four main drivers behind the breaking of the GPU-centric model are:
- Prefill vs. Decode Phases: Large AI models have two distinct inference stages. Prefill (or prompt ingestion) requires ingesting massive input data to perform heavy computation (matrix multiplies) to build a context – a task GPUs handle well as it’s compute-bound. Decode/Generation is the token-by-token output stage, which is memory-bandwidth-bound (the model generates one word at a time and must rapidly fetch/update data for each token). This split means that no single chip can optimally handle both phases. NVIDIA itself is re-architecting its roadmap accordingly, as the Vera Rubin inference chips divide into a “Rubin CPX” part for the prefill and a Groq-derived high-speed part for decode.
- SRAM “Scratchpad” Advantage: Groq’s secret sauce is SRAM – static RAM embedded directly on the processor. SRAM offers blazingly fast, low-energy data access (accessing a bit in on-chip SRAM uses ~0.1 pJ, versus 20–100× more energy to access a bit from external DRAM). SRAM can serve as an ultra-fast scratchpad for computations, allowing AI agents to manipulate data without constantly shuffling to slower memory. The trade-off is between size and cost – SRAM is chip-area-intensive, so these architectures can’t have as much total memory as DRAM-based GPUs. SRAM-centric chips shine for smaller models (roughly ≤8 billion parameters), which happen to cover a large and growing segment of inference: distilled models, edge and interactive AI, robotics, voice assistants, etc.
- Portable Software Stacks (Threat to Ecosystem Lock-in): Another factor pushing specialization is the erosion of NVIDIA’s software moat. Historically, the dominance was safeguarded by CUDA (their proprietary programming stack); getting high performance on anything other than NVIDIA GPUs was a nightmare. But companies like Anthropic have developed portable AI stacks – software layers allowing the same model to run on multiple types of accelerators (NVIDIA GPUs, Google TPUs, etc.) NVIDIA’s Groq move is in part a defensive response: by integrating an alternative architecture’s IP into its own lineup, NVIDIA can offer specialized inference performance without users leaving the CUDA ecosystem.
- Stateful AI Agents Need Memory (KV Cache): The rise of agentic AI (autonomous agents that carry on long, interactive tasks) introduced a new performance bottleneck: memory for state. These agents need to remember long histories of their actions and observations. In large language models, this short-term memory is stored as a KV (key-value) cache during inference. For advanced agents, the ratio of “tokens thought about vs. tokens output” can reach 100:1 – i.e. an agent might internally juggle 100 tokens of context for every 1 token it outputs. If that working set (the KV cache) doesn’t stay in fast memory, the model loses its “train of thought” and has to waste huge compute cycles recomputing context.
2026 and beyond will be an era of “extreme specialization” in AI hardware.
The Nvidia/Groq deal is a prime example of how even the market leader is spending billions to acquire specialized tech, signaling that more niche accelerators (for specific contexts, model sizes, or latency needs) are the future. For AI practitioners, this means thinking in terms of routing workloads to the right accelerator.
Instead of asking “which GPU do we buy?”, winners will ask “where does each part of our model run, and why?” The specialized inference stack will involve different chips optimized for other roles – much like a team of specialists rather than a single jack-of-all-trades. This mirrors what we’re about to discuss in video encoding: splitting workloads and using purpose-built silicon to achieve substantial efficiency gains.
The logic that “one general processor can’t efficiently handle it all” is just as true in video streaming. Over the past few years, we’ve witnessed a shift from software encoding on CPU farms toward custom video encoding chips purpose-built for compression tasks. Companies like Google and Meta have designed software-based silicon (Google’s Argos VCU and Meta’s MSVP, respectively) for video transcoding, while providers like NETINT offer plug-in ASIC cards (called VPUs).
High-Volume On-Demand Transcoding (Batch Processing at Scale)
The use case: This refers to platforms like YouTube or Facebook handling massive amounts of uploaded video. Every minute, hundreds of hours of video content are uploaded and must be encoded into many formats and resolutions for streaming (e.g., one 4K upload becomes 8–10 different streams from 144p up to 4K).
Traditionally, this encoding was done by software codecs running on fleets of x86 servers, or in some cases on general-purpose GPUs. As video traffic exploded, especially with higher resolutions and more efficient (but compute-intensive) codecs like H.265/HEVC and AV1, the scaling and cost became prohibitive with general-purpose solutions. This scenario is analogous to the “prefill” phase in AI – a heavy batch computation problem. The goal is to maximize throughput (videos processed per second) and reduce cost per video/bit, rather than immediate latency.
Specialized solution: Google developed the Argos VCU (Video Coding Unit), an ASIC dedicated to video encoding, first revealed in 2021 and improved in subsequent generations. By integrating custom encoder logic (e.g. H.264/VP9 encoders) directly into silicon, Argos delivers orders-of-magnitude better throughput and efficiency.
Google reported Argos enabled 20× to 33× more efficient video processing than their previous CPU-based pipeline. In practical terms, a single Argos-enabled server can replace dozens of CPU servers – one analyst noted that switching out 10 full-CPU transcode servers with a single ASIC (VPU) based server can slash power and space needs dramatically.
Argos cards (with multiple encoder cores and on-board memory) allowed YouTube to keep up with surging upload volumes (e.g. during the pandemic). This efficiency and capacity gain is akin to how specialized inference chips handle massive prompts better than a CPU could. Meta followed a similar path: in 2023, it introduced the Meta Scalable Video Processor (MSVP), a custom transcoder ASIC for its video workloads. By deploying these chips in-house, Meta and Google reduced their reliance on expensive energy-hogging CPUs and gained an edge in supporting new formats.
Strategic angle: The move to custom ASICs for on-demand video was driven by cost and scale economics and enabled new capabilities. These hyperscalers essentially treated video encoding as a fixed, high-volume workload that justified custom silicon investment – much like large AI operators do for training chips.
The ROI came from huge efficiency gains (more videos processed per dollar and per joule) and from better user experience (supporting higher resolutions, new codecs, HDR, etc., which differentiate their platforms). It also gives them a measure of control over their stack: rather than waiting on Intel or GPU vendors to improve codec performance, they can implement what they need. And for those who can’t build in-house, companies like NETINT offer merchant ASIC cards (VPUs) so that any streaming service can obtain similar capabilities and efficiencies off the shelf. For batch transcoding, specialized video chips have become a competitive necessity at scale – echoing the AI inference trend that purpose-built hardware yields transformative efficiency gains rather than brute-forcing with more general CPUs.
Live Streaming and Interactive Video (Real-Time Encoding)
The use case: Live video – such as live game streams, broadcasts, video conferencing, and interactive video platforms – presents a different challenge. Here latency is paramount; the encoding must happen in real time (each frame is encoded on the fly and delivered in milliseconds).
In the AI analogy, this is like the “decode” phase or the instantaneous reasoning part – it’s all about keeping up with real-time data. Historically, live streaming at scale often used fast software encoders on CPUs (sacrificing compression efficiency for speed), or utilized GPUs’ hardware encoders (like Nvidia’s NVENC). Performing high-quality live encoding for thousands of streams is extremely resource-intensive.
Amazon’s Twitch – which handles millions of concurrent viewers – for years relied on conventional CPU-based encoding infrastructure. This limited how far they could push quality and advanced codecs. Twitch stuck to H.264 and had relatively low bitrate caps for a long time, limiting the quality that viewers could receive. Twitch’s lack of a custom ASIC was cited as a reason it couldn’t economically match YouTube’s live streaming quality. Live encoding is a parallel-throughput problem: a platform might need to encode hundreds of streams simultaneously with low latency. This workload is throughput- and memory-intensive (handling many video frames/streams in parallel to move large amounts of data).
Specialized solution: The answer has been the rise of dedicated Video Processing Units (VPUs) and hardware transcoders for live encoding. For the broader market, companies like NETINT have developed ASIC-based transcoder cards that slot into servers to accelerate live workloads. NETINT’s latest Quadra VPU series can encode up to 16 concurrent 1080p60 streams (or multiple 4K streams) in real time, supporting H.264, HEVC, and even the very demanding AV1 codec – all while drawing just 17 watts and delivering roughly 20× higher throughput per watt than a CPU doing those encodes in software.
That kind of efficiency mirrors the Groq-vs-GPU story in AI, where specialized silicon massively outperforms the general processor for the “instantaneous, streaming” task. The impact is clear – a streaming provider can replace a fleet of CPU servers (or avoid renting thousands of cloud CPU instances) with a few dozen VPUs, reducing latency and cost.
Strategic angle: For live video platforms, specialized encoders are becoming critical for both performance and economics. Viewers expect increasingly higher quality live streams (1080p60, 4K, maybe VR streams) without buffering – but simply throwing more CPU instances at the problem is extremely expensive and eventually hits power limits. Dedicated video ASICs change that equation, enabling higher quality and new formats (like low-latency AV1 live) at a fraction of the operating cost.
VPUs are a competitive differentiator: for example, YouTube could offer 4K live streaming and better compression earlier, while Twitch faced prohibitive cost to do the same. Just as in AI inference, if you don’t have the specialized tool, you will be forced to dial back performance; Twitch’s previous constraints are a case in point. Now that the technology is more widely available (with third-party VPUs and internal projects), we’re seeing an industry-wide adoption. Those who invest in the right hardware can deliver a better user experience (crisper streams, more reliable quality) at lower cost. Moreover, as data center power becomes scarce and expensive (with AI training soaking up a lot of power budgets), using power-efficient video chips is necessary to keep live streaming viable.
Strategic and Business Implications
Across both AI inference and video encoding, the move to specialized silicon is driven by a mix of performance demands, cost efficiency, and strategic control – and 2026 is when this shift will become obvious. A few closing points stand out:
- Performance per Watt = Competitive Advantage: Whether it’s an LLM responding in milliseconds or a live video transcoder handling 100 streams, specialized chips offer leaps in throughput and latency for the same power budget. This not only reduces operating costs (important as energy prices and data center space are constrained) but also enables better products. In AI, this means interactive applications that wouldn’t be possible before (real-time assistants with long context). In video, it means higher streaming quality (or new features like AI-enhanced video) without exorbitant cost. Companies that leverage these gains can offer services like ultra-low latency AI APIs or 4K live video that others realistically cannot if they stick to CPUs/GPUs.
- Workload Segmentation and “Routing”: A key lesson is the importance of explicitly labeling and segregating workloads to run on optimal hardware. NVIDIA’s CEO Jensen Huang described inference splitting into two classes of work; similarly, video encoding can be split into high-volume batch vs. real-time interactive vs. edge device tasks. This dynamic turns hardware selection into a routing decision rather than a one-time purchase decision. Cloud providers like Akamai are already doing this internally – allocating GPUs here, ASICs there – and offering those choices to customers. For enterprises and developers, the implication is to think modularly: optimize each part of your pipeline (AI or video) with the hardware that best meets that part’s needs (be it throughput, latency, cost, or memory).
- Independence and Ecosystem Control: Both Nvidia’s Groq move and the hyperscalers’ video ASIC projects underscore a strategic motive: reducing dependence on a single vendor or technology stack. NVIDIA integrated Groq’s tech in part to keep customers from straying to Google TPUs. Google built Argos because buying 10 million Intel CPUs (as one analysis estimated for YouTube’s scale) was not practical. By owning key silicon, companies can innovate faster and roll out new AI model features or video codecs without waiting. There’s a business model angle too: Google and Meta don’t sell Argos or MSVP – these are competitive advantages kept in-house. Meanwhile, NVIDIA’s licensing of Groq indicates an interesting model where the incumbent acquires the innovation of a startup rather than ceding market share. We may see more such deals or partnerships as specialized chip startups proliferate. For smaller players or those who can’t design chips, the emergence of third-party solutions (ASIC cards from NETINT, Graphcore IPUs for AI, etc.) means they can still ride the specialization wave by purchasing the tech. In fact, the ASIC-as-a-service ecosystem is growing – cloud providers might even offer specialized encoding or inference acceleration as a managed service, turning those efficiency gains into a product.
- Industry Recognition and Next Steps: The fact that a Technical Emmy was awarded in 2024 for hardware video encoding for the cloud is emblematic – efficient streaming is not just a tech problem, but a media industry priority. Similarly, AI inference at scale is now a boardroom topic for any enterprise leveraging AI. Both domains are grappling with the end of “infinite” compute assumptions – budgets and physical limits are real, so efficiency is king. Ultimately, the trajectory is clear: computing infrastructure in 2026 and beyond will be composed of multiple specialized processors working in concert, each tackling what it’s best at, from GPU-like matrix crunching to memory-centric streaming tasks. The companies and teams that internalize this – designing their systems with a “best tool for each job” mindset – will have a structural advantage in both performance and cost.
The evolution of AI inference and video encoding follows a common pattern: as workloads mature and scale up, special-purpose silicon delivers unbeatable efficiency and capabilities. Just as the AI world is splitting the GPU’s workload into new pieces (prefill vs. decode, large vs. small models) to meet latency and context demands, the video world is splitting encoding tasks out of general CPUs to meet quality and throughput demands. Both herald the end of the general-purpose era for these workloads. The age of specialized co-processors is here – and those who embrace it will reap the rewards in 2026.
Schedule a meeting to discuss where hardware video encoding delivers the highest performance per watt at scale.