¹«º£²Ê´¬¡¤6600(ÖйúÓÎ)¹Ù·½ÍøÕ¾



µã»÷ÏÂÔØ¡¶ÍòÕ×Ô°ÇøÒÔÌ«²Ê¹âÑо¿±¨¸æ¡·£¬½âËøÍòÕ×Ô°ÇøÍøÂ罨ÉèÖ¸ÄÏ
Á¢¼´ÏÂÔØ
ÎÞ¸Ð×¼Èë ÈËÎïͳ¹Ü Ø­ RG-SAM+5.X ÐÂÒ»´ú¸ßУAIÈÏ֤ƽ̨·¢²¼
Ô¤Ô¼Ö±²¥
²úÆ·
< ·µ»ØÖ÷²Ëµ¥
²úÆ·ÖÐÐÄ
²úÆ·
ºÏ×÷»ï°é
·µ»ØÖ÷²Ëµ¥
Ñ¡ÔñÇøÓò/ÓïÑÔ

¹«º£²Ê´¬¡¤6600ÍøÂç¸ßÐÔÄÜÍøÂ繫º£²Ê´¬¡¤6600¹ÙÍø£¬ÎªAIGC´òͨ ¡°Èζ½¶þÂö¡±

·¢²¼Ê±¼ä£º2023-03-20

ÒýÑÔ

AIGC£¨AI-Generated Content£¬È˹¤ÖÇÄÜÉú²úÄÚÈÝ£©½üÆÚ·¢Õ¹Ñ¸ÃÍ£¬µü´úËٶȸüÊdzÊÏÖÖ¸Êý¼¶µÄ±¬·¢Ê½Ôö³¤¡£ÆäÖУ¬GPT-4ºÍÎÄÐÄÒ»ÑÔµÄÍÆ³öÒýÆðÁËÈËÃÇ¶ÔÆäÉÌÒµ¼ÛÖµºÍÓ¦Óó¡¾°µÄ¸ß¶È¹Ø×¢¡£Ëæ×ÅAIGCµÄ·¢Õ¹£¬ÑµÁ·Ä£ÐͲÎÊý¹æÄ£´ÓǧÒÚµ½ÍòÒÚ¼¶±ð£¬µ×²ãGPUÖ§³Å¹æÄ£Ò²´ïµ½ÁËÍò¿¨¼¶±ð¡£Óɴ˵¼ÖµÄÍøÂç¹æÄ£²»¶ÏÔö´ó£¬ÍøÂç½Úµã¼äͨÐÅÃæÁÙ×ÅÔ½À´Ô½´óµÄÌôÕ½¡£Ôڴ˱³¾°Ï£¬ÈçºÎÌáÉýAI·þÎñÆ÷¼ÆËãÄÜÁ¦ºÍ×éÍøÍ¨ÐÅÄÜÁ¦²¢¼æ¹Ë³É±¾£¬ÒѳÉΪµ±Ç°È˹¤ÖÇÄÜÁìÓòµÄÖØÒªÑо¿·½ÏòÖ®Ò»¡£

¹«º£²Ê´¬¡¤6600ÍøÂçÕë¶ÔAIGCËãÁ¦¡¢GPUÀûÓÃÂÊÓëÍøÂçµÄ¹ØÏµ£¬ÒÔ¼°Ö÷Á÷HPC×éÍøÃæÁÙµÄÌôÕ½£¬ÍƳöÁËÒµ½çÏȽøµÄ“ÖÇËÙ”DDC£¨Distributed Disaggregated Chassis£¬·Ö²¼Ê½·Öɢʽ»úÏ䣩¸ßÐÔÄÜÍøÂ繫º£²Ê´¬¡¤6600¹ÙÍø£¬ÎªAIGCÒµÎñ´òͨ“Èζ½¶þÂö”£¬ÖúÁ¦ËãÁ¦Í»·ÉÃͽø¡£

¹«º£²Ê´¬¡¤6600ÍøÂçDDC²úÆ·Á¬½Ó·½Ê½Ê¾Òâͼ

AIGCËãÁ¦¡¢GPUÀûÓÃÂÊÓëÍøÂçµÄ¹ØÏµ

ChatGPTµÄѵÁ·Ê±¼äºÍGPUÀûÓÃÂʵĹØÏµ

ÒÔChatGPTΪÀý£¬ÔÚËãÁ¦·½Ã棬ʹÓÃ΢ÈíAzure AI³¬Ëã»ù´¡ÉèÊ©£¨ÓÉ10000¿é V100 GPU×é³ÉµÄ¸ß´ø¿í¼¯Èº£©ÉϽøÐÐѵÁ·£¬×ÜËãÁ¦ÏûºÄÔ¼3640PF-days£¨¼´Ã¿ÃëһǧÍòÒڴμÆË㣬ÔËÐÐ3640Ì죩£¬ÕâÀï×ö¸ö¹«Ê½»»ËãÒ»ÏÂ10000¿éV100ÐèҪѵÁ·¶à¾Ã£º

ChatGPTËãÁ¦ºÍѵÁ·Ê±¼ä±í

×¢£ºChatGPTËãÁ¦ÐèÇóÎªÍøÉÏ»ñÈ¡£¬Ôڴ˽ö¹©²Î¿¼¡£OpenAI ÔÚËûÃǵÄÎÄÕ“AI and Compute”ÖмÙÉèÀûÓÃÂÊΪ 33%¡£NVIDIA¡¢Ë¹Ì¹¸£ºÍ΢ÈíµÄÒ»×éÑо¿ÈËÔ±ÔÚ·Ö²¼Ê½ÏµÍ³ÉÏѵÁ·´óÐÍÓïÑÔÄ£Ð͵ÄÀûÓÃÂÊ´ïµ½ÁË 44% µ½ 52%¡£

ChatGPT¹ØÓÚѵÁ·Ê±¼äµÄ»Ø´ð

¸ù¾ÝChatGPTµÄ»Ø¸´À´¿´£¬±È½Ï·ûºÏÉÏÃæ±í¸ñ¼ÆËã³öÀ´µÄʱ¼ä£¬ÀûÓÃÂÊÓ¦¸Ã»áÔÚ50%×óÓÒ¡£

¿ÉÒÔ¿´³öÓ°ÏìÒ»¸öÄ£Ð͵ÄѵÁ·Ê±³¤Ö÷ÒªÒòËØÔÚÓÚGPUµÄÀûÓÃÂÊ£¬ÒÔ¼°GPU¼¯Èº´¦ÀíÄÜÁ¦¡£¶øÕâЩ¹Ø¼üÖ¸±êÓÖÓëÍøÂçЧÂÊÃÜÇÐÏà¹Ø¡£ÍøÂçЧÂÊÊÇÓ°ÏìAI¼¯ÈºÖÐGPUÀûÓÃÂʵÄÒ»¸öÖØÒªÒòËØ¡£ÔÚAI¼¯ÈºÖУ¬GPUͨ³£ÊǼÆËã½ÚµãµÄºËÐÄ×ÊÔ´£¬ÒòΪËüÃÇ¿ÉÒÔ¸ßЧµØ´¦Àí´ó¹æÄ£µÄÉî¶ÈѧϰÈÎÎñ¡£È»¶ø£¬GPUµÄÀûÓÃÂÊÊܵ½¶à¸öÒòËØµÄÓ°Ï죬ÆäÖÐÍøÂçЧÂÊÊÇÒ»¸ö¹Ø¼üÒòËØ¡£

ÍøÂçЧÂÊÓëGPUÀûÓÃÂʵĹØÏµ

ÍøÂçÔÚAIѵÁ·ÖаçÑÝ×ÅÖÁ¹ØÖØÒªµÄ½ÇÉ«¡£AI¼¯ÈºÍ¨³£Óɶà¸ö¼ÆËã½ÚµãºÍ´æ´¢½Úµã×é³É£¬ÕâЩ½ÚµãÐèҪƵ·±µØ½øÐÐͨÐźÍÊý¾Ý½»»»¡£Èç¹ûÍøÂçЧÂʵÍÏ£¬ÕâЩ½ÚµãÖ®¼äµÄͨÐŽ«»á±äµÃ»ºÂý£¬Õ⽫ֱ½ÓÓ°Ïìµ½AI¼¯ÈºµÄËãÁ¦¡£

µÍЧµÄÍøÂç¿ÉÄܵ¼ÖÂÒÔÏÂÎÊÌ⣬´Ó¶ø½µµÍGPUÀûÓÃÂÊ£º

Êý¾Ý´«Êäʱ¼äÔö¼Ó£ºÔÚµÍЧµÄÍøÂçÖУ¬Êý¾Ý´«ÊäµÄʱ¼ä½«»áÔö¼Ó¡£µ±GPUÐèÒªµÈ´ýÊý¾Ý´«ÊäÍê³Éºó²ÅÄܽøÐмÆËãʱ£¬GPUÀûÓÃÂʽ«»á½µµÍ£»

ÍøÂç´ø¿íÆ¿¾±£ºÔÚAI¼¯ÈºÖУ¬GPUͨ³£ÐèҪƵ·±µØÓëÆäËû¼ÆËã½Úµã½øÐÐÊý¾Ý½»»»¡£Èç¹ûÍøÂç´ø¿í²»×㣬GPU½«ÎÞ·¨»ñµÃ×ã¹»µÄÊý¾Ý½øÐмÆË㣬´Ó¶øµ¼ÖÂGPUÀûÓÃÂʽµµÍ£»

ÈÎÎñµ÷¶È²»¾ùºâ£ºÔÚµÍЧµÄÍøÂçÖУ¬ÈÎÎñ¿ÉÄܻᱻ·ÖÅäµ½ÓëGPU²»Í¬µÄ¼ÆËã½ÚµãÉÏ¡£µ±ÐèÒª´óÁ¿µÄÊý¾Ý´«Êäʱ£¬Õâ¿ÉÄܻᵼÖÂGPUÏÐÖõȴý£¬´Ó¶ø½µµÍGPUÀûÓÃÂÊ¡£

ΪÁËÌá¸ßGPUÀûÓÃÂÊ£¬ÐèÒªÓÅ»¯ÍøÂçЧÂÊ¡£Õâ¿ÉÒÔͨ¹ý²ÉÓøü¿ìµÄÍøÂç¼¼Êõ¡¢ÓÅ»¯ÍøÂçÍØÆË½á¹¹¡¢ºÏÀíÅäÖôø¿íµÈ·½·¨À´ÊµÏÖ¡£ÔÚѵÁ·Ä£ÐÍÖУ¬·Ö²¼Ê½ÑµÁ·µÄ²¢ÐжȣºÊý¾Ý²¢ÐС¢ÕÅÁ¿²¢ÐÐÓëÁ÷Ë®²¢Ðоö¶¨ÁËGPU´¦ÀíµÄÊý¾ÝÖ®¼äµÄͨÐÅÄ£ÐÍ¡£Ä£ÐÍÖ®¼äµÄͨÐÅЧÂÊÊܵ½ÒÔϼ¸¸öÒòËØµÄÓ°Ï죺

Ó°ÏìͨÐŵÄÒòËØ

ÆäÖУ¬´ø¿íºÍÉ豸ת·¢Ê±ÑÓÊܵ½Ó²¼þÏÞÖÆ£¬¶Ë´¦ÀíʱÑÓÊܼ¼ÊõÑ¡Ôñ£¨TCP or RDMA£©Ó°Ï죬RDMA»á¸üµÍ£¬ÅŶӺÍÖØ´«ÔòÊܵ½ÍøÂçÓÅ»¯ºÍ¼¼ÊõÑ¡ÔñµÄÓ°Ïì¡£

¸ù¾ÝÁ¿»¯Ä£ÐÍ[1]£ºGPUÀûÓÃÂÊ = GPUÄÚµü´ú¼ÆËãʱ¼ä/£¨GPUÄÚµü´ú¼ÆËãʱ¼ä+ÍøÂç×ÜÌåͨÐÅʱ¼ä£©À´¼ÆËãµÃ³öÒÔϽáÂÛ£º

´ø¿íÍÌÍÂÓëGPUÀûÓÃÂʵÄÇúÏßͼ                                  ¶¯Ì¬Ê±ÑÓºÍGPUÀûÓÃÂʵÄÇúÏßͼ

¿ÉÒÔ¿´µ½ÍøÂç´ø¿íÍÌÍ¡¢¶¯Ì¬Ê±ÑÓ£¨ÓµÈû/¶ª°ü£©¶ÔGPUÀûÓÃÂÊÓ°ÏìÃ÷ÏÔ¡£

¸ù¾ÝͨÐÅ×ÜʱÑӵĹ¹³ÉÀ´¿´£º

ͨÐÅ×ÜʱÑÓ¹¹³Éͼ

¾²Ì¬Ê±ÑÓÏà½ÏÖ®ÏÂÓ°Ïì¸üС£¬ËùÒÔ¸üÓ¦¸Ã×ÅÖØÈ¥¿¼ÂÇÈçºÎ¼õÉÙ¶¯Ì¬Ê±ÑÓ£¬ÕâÑù¿ÉÒÔÓÐЧµÄÌáÉýGPUµÄÀûÓÃÂÊ£¬´Ó¶ø´ïµ½ÌáÉýËãÁ¦µÄÄ¿±ê¡£

Ö÷Á÷HPC×éÍøÃæÁÙµÄÌôÕ½

IB×éÍø°º¹óÇÒ·â±Õ

Infiniband×éÍøÊǵ±Ç°¸ßÐÔÄÜÍøÂçµÄЧ¹û×îӎ⣬ÀûÓó¬¸ß´ø¿íºÍ»ùÓÚCreditµÄ»úÖÆÈ·±£ÎÞÓµÈûºÍ³¬µÍʱÑÓ£¬µ«ÊÇÒ²ÊÇ×î°º¹óµÄ½â·¨£¬Ïà±Èͬ´ø¿íÏ´«Í³ÒÔÌ«ÍøµÄ×éÍø»á¹óÊý±¶¡£Í¬Ê±Infiniband¼¼Êõ·â±Õ£¬ÒµÄÚĿǰ³ÉÊ칩ӦÉ̽ö1¼Ò£¬¶ÔÓÚ×îÖÕÓû§À´Ëµ£¬ÎÞ·¨ÊµÏÖµÚ¶þ»õÔ´¡£

ËùÒÔÒµÄÚ´ó¶àÊýÓû§»áÑ¡Ôñ´«Í³ÒÔÌ«Íø×éÍøµÄ¹«º£²Ê´¬¡¤6600¹ÙÍø¡£

PFCºÍECN¿ÉÄÜ´¥·¢½µËÙ

µ±Ç°¸ßÐÔÄÜÍøÂçÖ÷Á÷×éÍø¹«º£²Ê´¬¡¤6600¹ÙÍøÊÇ»ùÓÚRoCE v2À´×齨֧³ÖRDMAµÄÍøÂç¡£ÆäÖÐÖØÒªµÄÁ½Ïî´îÅä¼¼ÊõÊÇPFCºÍECN£¬Á½Õß¾ùÊÇΪÁ˱ÜÃâÁ´Â·ÖеÄÓµÈû¶ø²úÉúµÄ¼¼Êõ¡£

¶à¼¶PFC×éÍøÏ»áÕë¶Ô½»»»»úÈë¿Ú£¨Ingress£©ÓµÈû£¬Öð¼¶·´Ñ¹µ½Ô´¶Ë·þÎñÆ÷ÔÝÍ£·¢ËÍ£¬»º½âÍøÂçÓµÈû£¬¹æ±Ü¶ª°ü£»µ«¸Ã¹«º£²Ê´¬¡¤6600¹ÙÍøÔڶ༶×éÍøÏ¿ÉÄÜ»áÃæÁÙPFC Deadlockµ¼ÖÂRDMAÁ÷Á¿Í£Ö¹×ª·¢µÄ·çÏÕ¡£

ͼƬ

PFC¹¤×÷»úÖÆÊ¾Òâͼ

¶øECNÔò»á»ùÓÚ¶Ô½»»»»ú³ö¿Ú£¨Egress£©ÓµÈûµÄÄ¿µÄ¶Ë¸ÐÖª£¬Ö±½ÓÉú³ÉÒ»¸öRoCEv2 CNP°ü֪ͨԴ¶Ë½µËÙ£¬Ô´·þÎñÆ÷ÊÕµ½CNP±¨ÎÄ£¬¾«×¼½µµÍ¶ÔÓ¦QPµÄ·¢ËÍËÙÂÊ£¬»º½âÓµÈûµÄͬʱ±ÜÃâÎÞ²î±ð½µËÙ¡£

ECN±ê¼ÇλʾÒâͼ

ÕâÁ½Ïî¼¼Êõ±¾Éí²¢Ã»ÓÐʲôÎÊÌ⣬¶¼ÊÇΪÁ˽â¾öÓµÈû¶øµ®ÉúµÄ¼¼Êõ£¬µ«ÊDzÉÓÃÕâÖÖ¼¼Êõºó¿ÉÄÜ»á±»ÍøÂçÖпÉÄܲúÉúµÄÓµÈû¶øÆµ·±´¥·¢£¬×îÖջᵼÖÂÔ´¶ËÔÝÍ£»ò½µËÙ·¢ËÍ£¬Í¨ÐÅ´ø¿í»á½µµÍ£¬»á¶ÔGPUÀûÓÃÂʲúÉú±È½Ï´óµÄÓ°Ï죬´Ó¶øÔì³ÉÕû¸ö¸ßÐÔÄÜÍøÂçµÄËãÁ¦±»À­µÍ¡£

ECMP²»¾ùºâ¿ÉÄܻᵼÖÂÓµÈû

ÔÚAIѵÁ·¼ÆËãÖлáÓÐAll-ReduceºÍAll-to-AllÁ½ÖÖÖ÷ÒªµÄÄ£ÐÍ£¬Á½ÖÖÄ£ÐͶ¼ÐèҪƵ·±µÄ´ÓÒ»¸öGPUµ½ÁíÍâ¶à¸öGPU½øÐÐͨÐÅ¡£

All-to-AllÄ£ÐÍ                       All-ReduceÄ£ÐÍ

ÔÚ´«Í³×éÍøÏ£¬ToRºÍLeafÉ豸²ÉÓ÷ÓÉ+ECMPµÄ×éÍøÄ£Ê½£¬ECMP»á»ùÓÚÁ÷½øÐйþÏ£¸ºÔØÑ¡Â·£¬ÓÐÒ»ÖÖ¼«¶ËÇé¿ö¾ÍÊÇijһÌõECMPÁ´Â·ÒòΪһÌõ´óÏóÁ÷¶øÅÜÂú£¬ÆäÓà¶àÌõECMPÁ´Â·Ïà¶Ô¿ÕÏУ¬Ôì³É¸ºÔز»¾ùµÄÇé¿ö¡£

´«Í³ECMP²¿Êðͼ

ÔÚÄÚ²¿Ä£Äâ8ÌõECMPÁ´Â·µÄ²âÊÔ»·¾³Ï£¬²âÊÔ½á¹ûÈçÏ£º

ECMPÁ÷Á¿²âÊÔ½á¹û

¿ÉÒÔ¿´³ö£¬»ùÓÚÁ÷µÄECMP»áÔì³É½ÏÃ÷ÏÔµÄij¼¸ÌõÁ´Â·Õ¼Óã¨ECMP1-5ºÍ1-6£©ºÍ¿ÕÏУ¨ECMP1-0ÖÁ1-3½Ï¿ÕÏУ©£¬¶øÔÚAll-ReduceºÍAll-to-AllµÄÁ½ÖÖÄ£ÐÍÏ£¬ ¾ÍºÜÈÝÒ×Ôì³ÉÒ»Ìõ·ÏßÒòΪECMPµÄ¸ºÔز»¾ù¶øÓµÈû£¬Ò»µ©ÓµÈûÔì³ÉÖØ´«£¬¾Í»áÌáÉý×ÜÌåµÄͨÐÅ×ÜʱÑÓ£¬´Ó¶ø½µµÍGPUÀûÓÃÂÊ¡£

ËùÒÔ£¬ÎªÁ˽â¾ö´ËÀàÎÊÌ⣬Ñо¿½çÌá³öÁËphost¡¢Homa¡¢NDP¡¢1RMA ºÍ AeolusµÈ·á¸»µÄ½â¾ö¹«º£²Ê´¬¡¤6600¹ÙÍø£¬ËüÃÇÔÚ²»Í¬³Ì¶ÈÉϽâ¾öÁË incast£¬ »¹½â¾öÁ˸ºÔØÆ½ºâºÍµÍÑÓ³ÙÇëÇó/ÏìÓ¦Á÷Á¿µÄÎÊÌâ¡£µ«ÊÇÒ²´øÀ´ÁËеÄÌôÕ½£¬ÍùÍùÕâЩÑо¿µÄ¹«º£²Ê´¬¡¤6600¹ÙÍø¶¼ÊÇÐèÒª¶Ëµ½¶ËÀ´½â¾öÎÊÌ⣬¶ÔÖ÷»ú¡¢Íø¿¨¡¢ÍøÂçµÄ¸Ä¶¯½Ï´ó£¬¶ÔÓÚÒ»°ãÓû§¶øÑÔ£¬³É±¾½Ï¸ß¡£

¿òʽ½»»»»ú×éAI¼¯ÈºµÄÌôÕ½

º£ÍâÓв¿·Ö»¥ÁªÍø¹«Ë¾¼ÄÏ£ÍûÓÚÀûÓòÉÓÃDNXоƬ֧³ÖVOQ¼¼ÊõµÄ¿òʽ½»»»»úÀ´½â¾ö¸ºÔز»¾ùºâ´øÀ´µÄ´ø¿íÀûÓÃÂʵ͵ÄÎÊÌ⣬µ«Ò²ÃæÁÙÒÔϼ¸¸öÌôÕ½¡£

À©Õ¹ÄÜÁ¦Ò»°ã£¬»ú¿ò´óСÏÞÖÆÁË×î´ó¶Ë¿ÚÊý£¬ÈçÏë×ö¸ü´ó¹æÄ£µÄ¼¯Èº£¬ÐèÒªºáÏòÀ©Õ¹¶à¸ö»ú¿ò£¬Ò²»á²úÉú¶à¼¶PFCºÍECMPµÄÁ´Â·£¬ËùÒÔ¿òÖ»ÊʺÏÓÚС¹æÄ£²¿Êð£»

É豸¹¦ºÄ´ó£¬»ú¿òÄÚÏß¿¨Ð¾Æ¬¡¢FabricоƬ¡¢·çÉȵÈÊýÁ¿Öڶ࣬µ¥É豸µÄ¹¦ºÄ¼«´ó£¬ÇáËɳ¬¹ý2ÍòÍߣ¬ÓеÄÉõÖÁ3Íò¶àÍߣ¬¶Ô»ú¹ñµçÁ¦ÒªÇó¸ß£»

µ¥É豸¶Ë¿ÚÊýÁ¿¶à£¬¹ÊÕÏÓò´ó¡£

ËùÒÔ»ùÓÚÒÔÉÏÔ­Òò£¬¿òʽÉ豸ֻÊʺÏС¹æÄ£²¿ÊðAI¼ÆË㼯Ⱥ¡£

ÐÂÐÎ̬DDC²úÆ·µ®Éú£¬Ö§³ÅAIGC¸ßÐÔÄÜÍøÂç

DDCÊÇÒ»ÖÖ·Ö²¼Ê½½âñî»ú¿òÉ豸µÄ½â¾ö¹«º£²Ê´¬¡¤6600¹ÙÍø£¬²ÉÓõÄоƬºÍ¹Ø¼ü¼¼ÊõÓ봫ͳ¿òʽ½»»»»ú¼¸ºõÏàͬ£¬µ«DDC¼Ü¹¹¼òµ¥Ö§³Öµ¯ÐÔÀ©Õ¹ºÍ¹¦ÄÜ¿ìËÙµü´ú¡¢¸üÒײ¿Êð¡¢µ¥»ú¹¦ºÄµÍ¡£

ÈçÏÂͼËùʾ£¬ÒµÎñÏß¿¨×÷Ϊǰ¶Ë³ÉΪNCP½ÇÉ«£¬½»»»Íø°å×÷Ϊºó¶Ë³ÉΪNCF½ÇÉ«£¬Ô­ÏÈÁ½ÕßÖ®¼äµÄÁ¬½ÓÆ÷×é¼þÏÖÔÚ±»¹âÏËÏßÀ´úÌæ£¬Ô­ÓпòʽÉ豸µÄ¹ÜÀíÒýÇæÔÚDDC¼Ü¹¹ÖÐÒ²³ÉΪÁËNCC¼¯ÖÐ/·Ö²¼Ê½µÄ¹ÜÀí×é¼þ¡£

DDC²úÆ·Á¬½Ó·½Ê½Ê¾Òâͼ

DDCÖ§³Ö³¬´ó¹æÄ£²¿Êð

DDC¼Ü¹¹Ïà½ÏÓÚ¿òʽ¼Ü¹¹µÄÓÅÊÆÔÚÓÚ¿ÉÒÔÌṩµ¯ÐÔ¿ÉÀ©Õ¹ÐÔ£¬×éÍø¹æÄ£¿ÉÒÔ¸ù¾ÝAI¼¯Èº´óСÀ´Áé»îÑ¡Ôñ¡£

µ¥POD×éÍøÖУ¬²ÉÓÃ96̨NCP×÷Ϊ½ÓÈ룬ÆäÖÐNCPÏÂÐй²36¸ö200G½Ó¿Ú£¬¸ºÔðÁ¬½ÓAI¼ÆË㼯ȺµÄÍø¿¨¡£ÉÏÐй²40¸ö200G½Ó¿Ú×î´ó¿ÉÒÔÁ¬½Ó40̨NCF£¬NCFÌṩ96¸ö200G½Ó¿Ú£¬¸Ã¹æÄ£ÉÏÏÂÐдø¿íΪ³¬ËÙ±È1.1:1¡£Õû¸öPOD¿ÉÖ§³Å3456¸ö200GÍøÂç½Ó¿Ú£¬°´ÕÕһ̨·þÎñÆ÷Åä8¿éGPUÀ´¼ÆË㣬¿ÉÖ§³Å432̨AI¼ÆËã·þÎñÆ÷¡£

µ¥POD×éÍø¼Ü¹¹Í¼

¶à¼¶POD×éÍøÖУ¬¿ÉÒÔʵÏÖ»ùÓÚPODµÄ°´Ð轨Éè¡£ÒòΪ¸Ã³¡¾°PODÖÐNCFÉ豸ҪÎþÉüÒ»°ëµÄSerDesÓÃÓÚÁ¬½ÓµÚ¶þ¼¶µÄNCF£¬ËùÒÔ´Ëʱµ¥POD²ÉÓÃ48̨NCP×÷Ϊ½ÓÈ룬ÏÂÐй²36¸ö200G½Ó¿Ú£¬µ¥PODÄÚ¿ÉÒÔÖ§³Å1728¸ö200G½Ó¿Ú¡£Í¨¹ýºáÏòÔö¼ÓPODʵÏÖ¹æÄ£µÄÀ©ÈÝ£¬ÕûÌå×î´ó¿ÉÖ§³Å10368¶à¸ö200GÍøÂç¶Ë¿Ú¡£

NCPÉÏÐÐ40¸ö200G½ÓPODÄÚ40̨NCF£¬PODÄÚNCF²ÉÓÃ48¸ö200G½Ó¿ÚÏÂÐУ¬48¸ö200G½Ó¿Ú·ÖΪ16¸öÒ»×éÉÏÐе½µÚ¶þ¼¶µÄNCF¡£µÚ¶þ¼¶NCF²ÉÓÃ40¸öÆ½Ãæ£¬Ã¿¸öÆ½Ãæ3̨µÄÉè¼Æ£¬·Ö±ð¶ÔÓ¦ÔÚPODÄÚµÄ40̨NCF¡£

Õû¸öÍøÂçµÄPODÄÚʵÏÖÁ˳¬ËÙ±È1.1:1£¬¶øÔÚPODºÍ¶þ¼¶NCFÖ®¼äʵÏÖÁË1:1µÄÊÕÁ²±È¡£

200GµÄÍøÂç¶Ë¿Ú¼æÈÝ100GÍø¿¨½ÓÈë£¬ÌØÊâÇé¿öÏ¿ÉÀûÓÃ1·Ö2»ò1·Ö4ÏßÀ¼æÈÝ25/50GÍø¿¨¡£

»ùÓÚVOQ+Cell»úÖÆ¸ºÔظü¾ùºâ£¬¶ª°üÂʸüµÍ

ÒÀÍÐ·ÖÆ¬ºóµÄCellsת·¢»úÖÆ½øÐж¯Ì¬¸ºÔؾùºâ£¬ÊµÏÖÑÓ³ÙµÄÎȶ¨ÐÔ£¬½µµÍÁ˲»Í¬Á´Â·µÄ´ø¿í·åÖµ²î¡£

ת·¢Á÷³ÌÈçͼËùʾ£º

Ê×ÏÈ·¢ËͶ˴ÓÍøÂçÖнÓÊÕÊý¾Ý°ü²¢·ÖÀൽVOQsÖд洢£¬ÔÚ·¢ËÍÊý¾Ý°ü֮ǰ»áÏÈ·¢ËÍCredit±¨ÎÄÈ·¶¨½ÓÊÕ¶ËÊÇ·ñÓÐ×ã¹»µÄ»º´æ¿Õ¼ä´¦ÀíÕâЩ±¨ÎÄ£»

Èç¹û¿ÉÒÔÔò½«Êý¾Ý°ü·ÖƬ³ÉCells²¢ÇÒ¶¯Ì¬¸ºÔؾùºâµ½ÖмäµÄFabric½Úµã¡£ÕâЩCellsÔÚ½ÓÊÕ¶Ë»á½øÐÐÖØ×éºÍ´æ´¢£¬½ø¶ø×ª·¢µ½ÍøÂçÖС£

CellsÊÇ»ùÓÚÊý¾Ý°üµÄÇÐÆ¬¼¼Êõ£¬Ò»°ã´óСΪ 64-256Byte¡£

ÇÐÆ¬ºóµÄCells¸ù¾Ýreachability table ÖÐ cell  destination µÄ²éѯÀ´¾ö¶¨ÈçºÎת·¢£¬²¢²ÉÓÃÂÖѯµÄ»úÖÆ·¢ËÍ¡£ÕâÑù×öµÄºÃ´¦Ïà±ÈECMP°´Á÷½øÐйþÏ£¼ÆËãºóÑ¡ÔñijһÌõ·µÄģʽ£¬ÇÐÆ¬ºóµÄCells¸ºÔØ»á³ä·ÖÀûÓõ½Ã¿Ò»ÌõÉÏÐÐÁ´Â·£¬ËùÓÐÉÏÐÐÁ´Â·µÄ´«ÊäÊý¾ÝÁ¿»á½üËÆÏàµÈ¡£

Èç¹û½ÓÊÕ¶ËÔÝʱûÄÜÁ¦´¦Àí±¨ÎÄ£¬±¨ÎÄ»áÔÚ·¢ËͶ˵ÄVOQÖÐÔݴ棬²¢²»»áÖ±½Óת·¢µ½½ÓÊն˵¼Ö¶ª°üÎÊÌâµÄ²úÉú£¬Ã¿Æ¬DNXоƬ¿ÉÒÔÌṩоƬÄÚOCB»º´æÒÔ¼°Æ¬Íâ8GBµÄHBM¸ßËÙ»º´æ£¬¶Ô200G¶Ë¿ÚÏ൱ÓÚ¿ÉÒÔ»º´æ150ms×óÓÒµÄÊý¾Ý¡£Ö»Óе±¶Ô¶ËCredit±¨ÎÄÃ÷È·¿ÉÒÔ½ÓÊÜʱ²Å»á·¢ËÍ¡£ÕâÑùµÄ»úÖÆÏ£¬³ä·ÖÀûÓûº´æ¿ÉÒÔ´ó·ù¶È¼õÉÙ¶ª°ü£¬ÉõÖÁ²»»á²úÉú¶ª°üÇé¿ö¡£¼õÉÙÊý¾ÝÖØ´«£¬ÕûÌåͨÐÅʱÑÓ¸üÎȶ¨¸üµÍ£¬´Ó¶ø¿ÉÒÔÌá¸ß´ø¿íÀûÓÃÂÊ£¬½ø¶øÌáÉýÒµÎñÍÌÍÂЧÂÊ¡£

PFCµ¥Ìø²¿Êðϲ»»á²úÉúËÀËø

°´ÕÕDDCµÄÂß¼­À´¿´£¬ËùÓÐNCPºÍNCF¿ÉÒÔ¿´³Éһ̨É豸£¬ËùÒÔÔÚ´ËÍøÂçÖв¿ÊðRDMAÓòºó£¬Ö»ÔÚÕë¶Ô·þÎñÆ÷µÄ½Ó¿Ú´¦´æÔÚ1¼¶µÄPFC£¬²»»áÏñ´«Í³ÍøÂçÒ»Ñù²úÉú¶à¼¶PFCµÄÑ¹ÖÆÓëËÀËø¡£ÁíÍâ¸ù¾ÝDDCµÄÊý¾Ýת·¢»úÖÆ£¬¿ÉÔÚ½Ó¿Ú´¦²¿ÊðECN£¬Ò»µ©ÔÚÄÚ²¿µÄCreditºÍ»º´æ»úÖÆÎÞ·¨Ö§³ÅÍ»·¢Á÷Á¿£¬¿ÉÒÔÏò·þÎñÆ÷¶Ë·¢ËÍCNP±¨ÎÄÒªÇó½µËÙ£¨Í¨³£Çé¿öÏÂÔÚAIµÄͨÐÅÄ£ÐÍÏ£¬All-to-AllºÍAll-Reduce+CellÇÐÆ¬¿ÉÒÔ½«Á÷Á¿¾¡¿ÉÄܵľùºâ£¬ºÜÄѳöÏÖ1¸ö¶Ë¿Ú±»´òÂúµÄÇé¿ö£¬ËùÒÔECNÔÚ¶àÊýÇé¿ö¿ÉÒÔ²»ÅäÖã©¡£

ÎÞNCCÉè¼Æ£¬²ÉÓ÷ֲ¼Ê½OSÌáÉý¿É¿¿ÐÔ

ÔÚ¹ÜÀí¿ØÖÆÆ½ÃæÉÏ£¬ÎªÁ˽â¾ö¹ÜÀíÍø¹ÊÕÏÒÔ¼°NCCµ¥µã¹ÊÕϵÄÓ°Ï죬ÎÒÃÇÈ¡ÏûÁËNCCµÄ¼¯ÖпØÖÆÃ棬¹¹½¨ÁË·Ö²¼Ê½OS£¬Í¨¹ýSDNÔËά¿ØÖÆÆ÷ͨ¹ý±ê×¼½Ó¿Ú£¨Netconf¡¢GRPCµÈ£©ÅäÖùÜÀíÉ豸£¬Ã¿Ì¨NCPºÍNCF¶ÀÁ¢¹ÜÀí£¬ÓжÀÁ¢µÄ¿ØÖÆÃæºÍ¹ÜÀíÃæ¡£

²âÊԶԱȽá¹û

´Ó¹«º£²Ê´¬¡¤6600¹ÙÍøÀíÂÛÉÏ˵£¬DDCÓµÓÐÖ§³Öµ¯ÐÔÀ©Õ¹ºÍ¹¦ÄÜ¿ìËÙµü´ú¡¢¸üÒײ¿Êð¡¢µ¥»ú¹¦ºÄµÍµÈÖÚ¶àÓÅÊÆ£»µ«´Óʵ¼Ê½Ç¶È³ö·¢£¬´«Í³×éÍøÒ²ÓµÓÐÖîÈçÊÐÃæ¿ÉÑ¡Æ·ÅÆºÍ²úƷ·Ï߽϶à¡¢¿ÉÖ§³Å¸ü´ó¹æÄ£µÄ¼¯ÈºµÈ¼¼Êõ³ÉÊì´øÀ´µÄÓÅÊÆ¡£Òò´ËÔÚ¿Í»§ÃæÁÙÏîÄ¿ÐèÇóʱ¾¿¾¹ÊÇÑ¡Ôñ¸ü¸ßÐÔÄܵÄDDC£¬»¹ÊǸü´ó¹æÄ£²¿ÊðµÄ´«Í³×éÍø£¬¿ÉÒԲο¼ÏÂÃæµÄ¶Ô±È¼°²âÊÔ½á¹û£º

´«Í³×éÍøÓëDDC²âÊԶԱȽá¹ûͼ

ͬʱÎÒÃÇʹÓÃOpenMPI²âÊÔÌ×¼þ½øÐÐÁË¿òʽÉ豸£¨¿òʽÉ豸ºÍDDCÔ­ÀíÏàͬ£¬±¾´Î²ÉÓÿòʽ²âÊÔ£©ºÍ´«Í³×éÍøÉ豸µÄ¶Ô±ÈÄ£Äâ²âÊÔ£¬½áÂÛÊÇÔÚAll-to-All³¡¾°Ï£¬Ïà½ÏÓÚ´«Í³µÄ×éÍø£¬¿òʽÉ豸´ø¿íÀûÓÃÂÊÌáÉýÔ¼20%£¨¶ÔÓ¦GPUÀûÓÃÂÊÌáÉý8%×óÓÒ£©¡£

¿òʽÉ豸ºÍ´«Í³×éÍøÉ豸µÄ¶Ô±ÈÄ£Äâ²âÊÔ

¹«º£²Ê´¬¡¤6600É豸½éÉÜ

»ùÓÚ¶Ô¿Í»§ÐèÇóµÄÉî¿ÌÀí½â£¬¹«º£²Ê´¬¡¤6600ÍøÂçÒѾ­ÂÊÏÈÍÆ³öÁËÁ½¿î¿É½»¸¶²úÆ·£¬·Ö±ðÊÇ200G NCP½»»»»úºÍ200G NCF½»»»»ú¡£

NCP£ºRG-S6930-36DC40F1½»»»»ú

¸Ã½»»»»ú2U¸ß¶È£¬Ìṩ36¸ö200GµÄÃæ°å¿Ú£¬40¸ö200GµÄFabricÄÚÁª¿Ú£¬4¸ö·çÉȺÍ2¸öµçÔ´¡£

NCF£ºRG-X56-96F1½»»»»ú

¸Ã½»»»»ú4U¸ß¶È£¬Ìṩ96¸ö200GµÄFabricÄÚÁª¿Ú£¬8¸ö·çÉȺÍ4¸öµçÔ´¡£

δÀ´¹«º£²Ê´¬¡¤6600ÍøÂ绹»á¼ÌÐøÑз¢¡¢ÍƳö400G¶Ë¿ÚÐÎ̬²úÆ·£¬¾´ÇëÆÚ´ý¡£

½áÓï

¹«º£²Ê´¬¡¤6600ÍøÂ磨֤ȯ´úÂ룺301165£©×÷ΪÐÐÒµÁìµ¼Õߣ¬Ò»Ö±ÖÂÁ¦ÓÚÌṩ¸ßÆ·ÖÊ¡¢¸ß¿É¿¿ÐÔµÄÍøÂçÉ豸ºÍ½â¾ö¹«º£²Ê´¬¡¤6600¹ÙÍø£¬ÒÔÂú×ã¿Í»§¶ÔÓÚÖÇËãÖÐÐIJ»¶ÏÌá¸ßµÄÐèÇó¡£ÔÚÍÆ³ö“ÖÇËÙ“DDC½â¾ö¹«º£²Ê´¬¡¤6600¹ÙÍøµÄͬʱ£¬¹«º£²Ê´¬¡¤6600ÍøÂçÒ²ÔÚ»ý¼«Ì½Ë÷ºÍ¿ª·¢´«Í³×éÍøÖеĶËÍøÓÅ»¯¹«º£²Ê´¬¡¤6600¹ÙÍø£¬Í¨¹ý³ä·ÖÀûÓ÷þÎñÆ÷ÖÇÄÜÍø¿¨´îÅäÍøÂçÉ豸ЭÒéµÄÓÅ»¯£¬ÊµÏÖÕûÍø´ø¿íÀûÓÃÂÊÌáÉý£¬°ïÖú¿Í»§¸ü¿ìÓ­À´AIGCÖÇËãʱ´ú¡£

²Î¿¼ÎÄÏ×£º

[1]Deepak Narayanan, Mohammad Shoeybi, Jared Casper£¬Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM£¬arXiv:2104.04473v5 [cs.CL] 23 Aug 2021

¹Ø×¢¹«º£²Ê´¬¡¤6600
¹Ø×¢¹«º£²Ê´¬¡¤6600¹ÙÍøÎ¢ÐÅ
ËæÊ±Á˽⹫˾×îж¯Ì¬

·µ»Ø¶¥²¿

ÊÕÆð
ÎĵµAIÖúÊÖ
ÎĵµÆÀ¼Û
¸Ã×ÊÁÏÊÇ·ñ½â¾öÁËÄúµÄÎÊÌ⣿
Äú¶Ôµ±Ç°Ò³ÃæµÄÂúÒâ¶ÈÈçºÎ£¿
²»Õ¦µÎ
·Ç³£ºÃ
ÄúÂúÒâµÄÔ­ÒòÊÇ£¨¶àÑ¡£©£¿
Äú¶ÔÎĵµÊÇ·ñ»¹ÓÐÆäËüµÄÎÊÌâ»ò½¨Ò飿
Ϊ¾¡¿ì½â¾öÎÊÌ⣬ÇëÄúÁôÏÂÁªÏµ·½Ê½Òﱋȯ¸´
ÓÊÏä
ÊÖ»úºÅ
¸ÐлÄúµÄ·´À¡£¡
ÇëÑ¡Ôñ·þÎñÏîÄ¿
¹Ø±Õ×Éѯҳ
ÊÛǰ×Éѯ ÊÛǰ×Éѯ
ÊÛǰ×Éѯ
ÊÛºó·þÎñ ÊÛºó·þÎñ
ÊÛºó·þÎñ
Òâ¼û·´À¡ Òâ¼û·´À¡
Òâ¼û·´À¡
¸ü¶àÁªÏµ·½Ê½
¡¾ÍøÕ¾µØÍ¼¡¿¡¾sitemap¡¿