Arista做为交换机市场的后起之秀,凭借着低延时功能超前的硬件产品以及良好的交换机操作系统,在竞争激烈的数据中心和金融数通领域占据了不少的市场,因此,这家公司一定有很多值得去学习的地方。

美国的公司不比国内的公司,他们往往具有极强的自信,将产品的技术文档和白皮书写的相当之细,不怕你来抄,因为你抄出来时我又有新的创新,让人望尘莫及了。

闲话不多说,晚上抽空看了《ARISTA_7250X_7300_MultiChip_Switch_Architecture》这篇白皮书,里面提到的一些技术实现,不论是对于交换机硬件设计,或者网络甚至软件设计,都有一定值得借鉴的地方,因此我把这些东西记录起来,便于自己后续学习、应用。

switch fabric(交换矩阵)动态负载分担技术

The interconnection between port ASICs on the 7250X and 7300 series platforms are built around a packet based fabric. In such a fabric, a packet is sent over a single fabric link in entirety, avoiding the need to fragment and reassemble. This approach reduces latency to a level similar to that seen in single system top of rack switches, while still providing modular system port density.

Many traditional internal CLOS switches and packet-based fabrics rely on a set of hashing algorithms to evenly distribute frames crossing the fabric. While the efficiency of such algorithms vary widely, no available algorithm claims to support 100% efficiency, which has typically resulted in switching fabric performance significantly less than the theoretical capacity.

This limitation inherent in previous CLOS based systems has been eliminated with the Arista 7250X and 7300 series switches, which leverage Dynamic Load Balancing Fabric (DLBF) technology. DLBF actively monitors internal port-fabric links and both allocates new flows and rebalances existing flows to fabric links with the lowest utilization, providing an efficient and well-balanced distribution of load over all links at all times, without the need for dividing packets into smaller segments, adding additional internal headers, or using fabric over-speed techniques to compensate for the hashing algorithms inherent inefficiency.

At the packet processor level each fabric link is tracked and monitored, both in terms of link utilization and queue depth, this data is quantized to identify the optimum interface at a particular point in time. Each new flow received by the packet processor is mapped to the current optimum interface based on the computed hash result. The rate of a particular traffic flow generally does not remain consistent, especially in the case of TCP communication. DLBF provides a mechanism to periodically re-balance flows, ensuring the distribution remains consistently optimized. Rebalancing is achieved using an inactivity timer, if the time between receiving two packets in the same flow is greater than the inactivity timer, then the flow is rebalanced over the current optimum link. Using the inactivity timer allows effective optimization of the fabric without risking out of order packets.

总体上说来,由于部分设备为chassis设备,而Arista往往又宣传自己低延时产品的亮点,因此合理分配与调度switch fabric的带宽,是一个非常关键的因素,总结起来,Arista的主要思路如下:

  • 监测并优化机制,即通过主动监测端口到switch fabric内部链路的带宽利用情况,生成必要的参考信息,用于对流量进行负载分担,从而避免传统的静态分担分配可能导致拥塞的问题
  • 平衡机制,通过周期性对单条流进行非活跃性检测,从而可以对流在内部链路上的分配进行重优化,由于操作基于流进行,可以有效避免TCP会话出现乱序的情况

从目前SDN领域来看,以上操作可以映射到采用SDN控制器对SDN交换机编程的应用场景中,通过SDN控制定期采样数据流在SDN网络中的分布以及链路带宽使用情况,从而利用控制机对数据流进行转发路径编程优化,将积极提高带宽利用效率;换句话说,Arista这里本质上就是用的一种SDN的概念。

除外,下附链接中提到的VMWare的LBT(Load balance teaming)技术,可以在NIC间进行动态负载分担,但从文档上看,还是基于packet的调度,而非基于流的调度,显然没有这里的优秀,当然,相信VMware有这样的实力继续去优化掉。

从系统设计的角度来看,逐渐优化是一种比较好的策略,甚至这是一种普适的原理,一次性把事情做好的可能性实际上是比较小的,更多的还是需要打磨,优化。

UFT,动态表项分配(估计是BCM的SmartTable技术)

Forwarding table flexibility on the 7250X and 7300X Series linecards is delivered through the Unified Forwarding Table (UFT). Each L2 and L3 forwarding element has a dedicated table and can additionally have the tables sizes augmented by allocating a portion of the UFT. The UFT contains 256K entries from 4 banks, where each bank can be allocated to forwarding tables. Much wider deployment flexibility is achieved by being able to dedicate the entire UFT to expand the MAC address tables in dense L2 environments, or a balanced approach achieved by dividing the UFT between MAC Address and Host route scale. The UFT can also be leveraged to support the expansion of the longest prefix match (LPM tables) – (future).

所有表项固定分配对于硬件设备来说,对于用户的限制更大,如果能把一定的选择权而不是绝对的选择权丢给用户,即让用户体验到灵活性以满足各种场景,也避免了限入不断支持客户的陷阱。

SSD技术/X86 CPU控制平台

An optional built-in SSD enables advanced capabilities for example long term logging, data captures and other services that are run directly on the switch.

Arista EOS®, the control-plane software for all Arista switches executes on multi-core x86 CPUs with multiple gigabytes of DRAM. As EOS is multi-threaded, runs on a Linux kernel and is extensible, the large RAM and fast multi-core CPUs provide for operating an efficient control plane with headroom for running 3rd party software, either within the same Linux instance as EOS or within a guest virtual machine.

这两项技术使得交换机的主控平变成标准的Linux平台,方便用于去扩展定制,同时也能借助于各种Linux生态环境中的丰富工具集,在交换机普遍还是嵌入式busybox系统的今天,这种创新显然为SDN和更高级别的集成,提供了一条新的路径。

报文芯片转发

整篇文档主要还是在讲解pizzabox和chassis系统上的转发流程,抛开switch asic而言,port asic即单板上的转发芯片应该采用了BCM的Trident II,这篇文章算是对Trident II的转发逻辑进行的一个详细描述,不知道BCM会不会郁闷Arista把自己设备讲的这么细。

附:

ARISTA_7250X_7300_MultiChip_Switch_Architecture.pdf

VMWARE LBT


Comments

comments powered by Disqus