About the SG2380(OASIS) category

SG2380

As the only RISC-V chip with integrated AI capabilities, SG2380 is naturally staying competitive. Through the SiFive 8-core X280 and Sophgo’s TPU DSA design, achieving over 20 TOPS@INT8 computational power, SG2380 effectively bring intelligence to modern PC, tablets, NAS devices, servers, and more. Together With SiFive 16-core P670, SG2380 open up limitless opportunities for tremendous applications in various domains such as autonomous driving, cloud computing, edge computing, and wearable devices.

With its large memory capacity (supporting up to 64GB) and over 20 TOPS@INT8 computational power, SG2380 can successfully deploy larger-scale LLMs like LLaMA-65B without the need for external mutiple Nvidia or AMD accelerator cards.

First To Go

  • The first SoC Based on the high-performance SiFive P670 and X280
  • The first affordable product perfect combination of RISC-V and AI, from DSA hardware architecture to toolchain and software ecosystem
  • The first RISC-V SoC Aligned With Andorid Compatibility and OpenHarmony
  • The best choice preferred high-performance platform for RISC-V developers and the essential platform for RISC-V+AI developers

Spec

SG2380 Specification
Processor 16-Core SiFive P670, RVV 1.0, >2GHz
Co-Processor/TPU/NPU 8-Core SiFive X280, RVV 1.0, >2GHz(Vector)
SOPHGO TPU(Matrix) 16TOPs@INT8 and 8TFLOPs@BF16(FP16)
Register-Level connection through VCIX with X280s
GPU IMG AXT-16-512
Memory 16G / 32G / 64G LPDDR5@6400Mbps, ECC, 128bit
Bandwidth >100GB/s
Storage Support UFS Module
1x M.2 M Key 2280 Support NVMe SSD
1x micro SD card for recovery or OS loading
4x SATA
Ethernet 2x 1GbE RJ45,Configurable with 2x 25Gbps through PCIE
USB 2x USB 3.0
2x USB 2.0
2x Front Panel USB 2.0
1x USB C with DP Alt Mode
Display Engine 2x HDMI 2.0 Out 4K@60fps, CEC and eARC
1x eDP
1x TP (Touch Screen)
2 MIPI-DSI 2K@60fps
Wireless 1x M.2 B Key 3050 4G/WIFI/BT
TDP 5W-30W
Price 120 US Dollars - 200 US Dollars

Vendor

SiFive

:grinning:P670 Datasheet
:wink:X280 Datasheet

VCIX

SiFive Vector Coprocessor Interface (VCIX) Software Specification Version 1.0
LLVM vendor-extensions

IMG

:smiling_face_with_three_hearts:IMG AXT-16-512 GPU

Newsletter

  1. SG2380 will shine brilliantly on Second International workshop on RISC-V for HPC
  2. Sophgo SG2380 - A 2.5 GHz 16-core SiFive P670 RISC-V processor with a 20 TOPS AI accelerator - CNX Software

Attention

Availability and Pricing: The journey of the SG2380 SoC has officially begun. Expect the silicon to be at your fingertips in 9 months, and the Milk-V Oasis in just 10 months. An irresistible starting price of $120 awaits post coupon code application.

Secure Your Spot: Grab your coupon code and reserve the future of RISC-V desktops with our distributor, Arace Tech: Milk-V Oasis Pre-order .

There will be 4 different modes: Coupon(20% discount), Super Early Bird(15%), Early Bird(10%) and Kickstarter/CrowdSuppy Special(5%)

We welcome any suggestions regarding the SG2380, and any valuable suggestions that are adopted will be particularly beneficial for developers, making it a significant contribution to the open-source community! We will also offer early access privileges and small gifts.

1 Like

Hi, the SOC sounds great, will there be any PCI-E lanes available, and what’s the rough delivery timeline? Ie. what stage is the SOC at? Design, IP, RTL, Fab?

PCIe Gen4 x16/x8+x4+x4
PCl-e Gen4 x1+x1+x1+x1
Millston(we hope):
RTL Freeze : In March 2024
Fab Tape Out: 2024-4-15 - 2024-5-30
We hope to start the super early bird pre-sale on Kickstarter and Crowdsupply platforms in March next year.

1 Like

Ahh, interesting, thanks.
I’m aware that the X280 received an enhancement that formalized the VCIX and increased the number of cores and clusters. The spec above mentions 8 cores X280, I’m assuming that’s 2 clusters of 4 cores.
Can I ask, if you folks are using the ‘enhanced X280 spec’ and talking to the TPU through the VCIX, and what the ratio of cores:cluster:tpu is? Ie. does each cluster have it’s own TPU, or is there a single common TPU out on the system fabric?
Seperately, given that the X280 and P670 have differing VLENs, that they cannot be managed in a Big:Little fashion like ARM does, ie. no live migration between the different cores, only within the similar cpu clusters. So I’m assuming that you will have different OS instances running on each set of cores? ie. one for the x670s and one for the P280s. Will there be support for communication between the cores, ie. mailboxes or ??
Thanks for the awesome SOC’s by the way!
TJ

1 Like

Yes. 2 clusters and 4 cores in each cluster.

At this circumstance, we estimated that the ratio of computing power would be like 1:10, It’s approximately an order of magnitude difference. Each X280 along with 2TOPs@INT8/ 1TFLOPs@BF16, we estimated that if based on ALU only, the computing power should be only over 1TOPs@INT8, however

The SiFive Intelligence Extensions enable the X280 processor to achieve a best-in-class 2.25 TFLOPs (bfloat16 MatMul) or 4.6 TOPS (INT8 MatMul)1 , supporting the broadest range of ML workloads and AI computation needs. The SiFive Intelligence acceleration gives a 5x improvement over that achievable using the out of the box RISC-V Vector ISA.

We are also concerned that the computational power of X280 might not be sufficient. Some even suggest that the TPU should include additional vector unit (currently, we only plan to use the TPU for matrix operations).

Secondly, it’s own TPU, Through VCIX, we can achieve register-level coupling and latency. Not a single TPU out of the system as our company did before(Bitmain and Sophgo). Otherwise, we wouldn’t need the X280; we could just use the P670 with an out-of-system TPU.

We have 2 premises:
:grinning: We need to implement a RISC-V + TPU extension, which is essentially a customized instruction set. We can define certain instructions using the TPU through the VCIX protocol standard. Additionally, VCIX has already been upstreamed to LLVM version 17.0.1. (What's new for RISC-V in LLVM 17 - Muxup)

We need to be compatible with the RISC-V software ecosystem (which is why we’re extending the ISA). If we use an out-of-system TPU, we wouldn’t be able to leverage RISC-V compilers and software ecosystem. By following the approach in point 1, we can be compatible with multiple projects, for example, IREE(OpenXLA)

So for compiler we only need to config the backend of LLVM, and for users, it is very flexible to choose the IR and Front, compatible with MLIR.(Very Friendly)

Yes, you are right and thanks for your correctness,(ALU for example) 128 bits for P670 while 256 bits for X280. We decided to use Arteris Noc, which could enable/disable cache coherence.

  • If coherence, we need to deal with the scheduler to dispatch workload.
  • If not coherence, there will be 2 different circumstances:
  1. one OS for P670s and the other for X280s, communicating through Mailbox
  2. one host OS for P670s to command, while no OS for X280s(so no malloc or else, only cmdbuf, much difficult)

As you know, it is a trade-off question and Thank you again for your advice! We really look forward to hearing more! :smiling_face_with_three_hearts:

I am not a purely tech guy but a software-background RISC-V ecosystem volunteer, if you have any questions, I will do my best to give you answer and ask for my colleagues’ help. Feel free to give me any Q

Update on 2023-10-21
Memory
244995e0a634c16919f736597df4f60
Display
91ad5f1cde487997bc78d549bc3fbba
High-Speed Interface
ad915c3c46eb16751b479483a5bfd10

How are the X280s configured?
The default seems to be VLEN of 512 and DLEN of 256, you mentioned that they have a VLEN of 256.
So is the DLEN 128 or 256?

Also, will we get user manuals for both processors in the future? AFAIK you currently need to buy a processor design from SiFive to get access to the corresponding manuals.

Thanks, I’m really excited for this SOC.

Thanks for your comments.
For X280, DLEN is 256 bits and VLEN is 512 bits and sorry for the misleading, I mentioned that 256 bits for X280 ALU and 128 bits for P670 ALU.

As for the configuration,we used Arteris Noc, which could enable/disable cache coherence. All we need is to make a fancy scheduler to dispatch workload.

iF COHERENCE ON
Since the DLEN/VLEN is different for 280s and 670s, a strong sheduler is definitely needed to dispatch workload, fortunately, at the first stage, 280s+TPU for AI workload is fairly easy to solve. Also we will continue to do the optimization.

IF OFF
We better deploy 2 different OS on 670s and 280s. In this way, Mailbox or other communicating way is supported. That will be very very interesting.
Or one Os, uvm with cmdbuffer, more efficient
For this, I have seen some interesting comments on Reddit: If it will be possible to execute some bare-metal code on P670 memory with x280 cores(kinda like Apple Silicon does with the GPU and Unified Memory system) :smiley:

Yes, definitely. SiFive and Sophgo has made a deep-collaboration now. SiFive will announce that in RISC-V Summit North America 2023

Also, https://www.reddit.com/r/hardware/comments/17duq2f/introducing_the_milkv_oasis_with_sg2380_a/?onetap_auto=true, it is clear

Thanks again for your advice.

To all,

There are Two different paths:

  • Hybrid Scheduling: Managing both the 670 and 280, handling the differences in DLEN/VLEN and thread migration between them. This approach requires a powerful scheduler to determine runtime/func call dynamic thread switching. This needs to be implemented in the Linux kernel. This is for (670 + 280) on the same OS.

  • Explicit Invocation: This is a form preferred by the community. It involves using UVM + coherence, with explicit calls to the (280 + TPU/NPU) by the application. The OS runs on the 670, and UVM + coherence handles the data buffer, allowing the 280 to run without the OS (to maximize performance). This approach takes inspiration from software models like CUDA and Apple’s M1/M2. However, it requires well-written drivers and is complex, but it offers lower power consumption.

Our Chief Engineer: The second approach is more beneficial. Running an operating system on the X280 doesn’t serve a meaningful purpose. Establishing shared virtual memory between the P670 (host CPU) and X280 (accelerator) is highly valuable, especially when working with large models that require significant data transfer between the host CPU and accelerator. By enabling the host and accelerator to share the same memory space, energy efficiency can be significantly improved. This is also an optimization seen in architectures such as Grace Hopper, M1/M2. Moreover, CUDA, OpenCL, HIP, and similar frameworks support Shared Virtual Memory (SVM).

Also, this structure is good to Pytorch and Tensorflow.

So, it is clear :grinning:
I apologize for my language mistakes.

OpenCL, CUDA, and HIP programming languages all share the same Shared Virtual Memory (SVM) model. Therefore, following this architectural design, compatibility with these programming languages can be achieved. This efficient compatibility enables the utilization of AI frameworks built on top of these languages to make full use of SVM. This approach leverages the high bandwidth of Wafer on Wafer memory and avoids the need for memory copies between the host and accelerator.

Forget to say, we might use Wafer on Wafer level package in some batches.

Using the X280 to control all Tensor Cores and SFUs, while the P670 is responsible for sending the entire kernel’s instructions to the X280 and placing the corresponding data in SVM, allows the X280 to independently determine which instructions should go to RVV, and which should be sent to Tensor Cores and SFUs. After completing its tasks, the X280 can uniformly send a message to the host using heterogeneous programming languages (such as OpenCL) events to indicate the completion of the current kernel execution.

The P670’s role is to ensure that software running on the host perceives the memory within the WoW and external DRAM as a continuous address space. The operating system should minimize its use of the WoW memory, reserving it primarily for applications on the AI side (OpenCL/CUDA/HIP) to utilize this portion of memory effectively.

So, changes will be needed in the operating system, and some adjustments might be required in the MMU as well. The objective is to enable user-space applications to have control over the MMU’s mappings and see virtual memory that can be effectively mapped to the portion of SVM within WoW. This particular aspect should be manageable by our colleagues on the OS team.

Thanks for the information @Sandor, everything is much more clear now. I have two questions however.

Did you run some internal benchmark how OS run on 16 P670 vs 24 P670 + x280 cores? I am mostly interested in Rust compilation times which are very parallelizable but any kind of information on that would be very interesting.

I am not sure, but as far as I know CUDA is a propretary language of NVIDIA to use with their GPUs, so I believe it will not be possible to use it with SG2380 NPU. I’m assuming that OpenCL and OpenXLA (so PyTorch, Tensorflow and JAX) will be able to target those cores, is that right? What about other languages like HIP and perhaps RISC-V assembly?

Thanks for your interest!
At present, there is no more relevant information, especially for Rust compilation times. I will post the relevant data on the forum when it is available.

Yes, since CUDA is not a open-source project. The reference to CUDA here is just an example, and it might not really applicable to our NPU.
And you’re right, OpenXLA is definitely supported since Google and Meta used X280+MXU(NPU or else) the same way.
Google TPU and SiFive Intelligence X280
HIP and RISC-V assembly are supported too.

1 Like

Sure, I don’t expect any specific benchmarks this early on, but I will keep my eye on any updates. Thanks for the update and very exciting SoC!

1 Like

Hi Sandor thanks for the info, it does make things clearer.

  1. So from what you’ve posted here, essentially it (the X280 cluster + TPU) will be treated as an OpenCL or OpenXLA accellerator and programmed that way. I’m assuming that this will require a small kernel or “BIOS” to be running on the X280 cluster in order to initialize it for communications, either that or you’re going to need to provide very detailed specifications for X280 & TPU power up and programming. I would request and suggest you make that “NPU” kernel/bios code available seperately and define the communications model cleanly. Will communications between the P670 and X280’s be ‘mailbox’ based then?

  2. I expect you’ve heard about the apparent layoffs/changes at SiFive, and given the stage of the SG2380 development it would be nice to hear an “official” word from Sophgo as to whether or not that’s going to impact the SG2380 and other product development. I’m hoping for your sakes that you’ve already completed the IP etc. handoffs well before all this and are independent of anything aside from possible support issues at this stage.

The first question. I will let you know once all the details of the technical route have been decided as soon as possible.

AFAIK,Sifive’s layoff only affected the Wake Build Team and some non-core personnel. However, due to business ethics, I am unable to disclose more information to you. This will not affect SG2380. All tasks, such as the IP handoffs you mentioned, have been completed earlier this month, and the research and development of Sophgo SG2380 will not be affected.

Thanks for your concern again!

Flexible PCl-e interface design:
PCI-e x16/x8+x4+x4
PCl-e x1+x1+x1+x1


newly updated

1 Like

Is ethernet on motherboard 1GbE or 2.5GbE?

1Gbps x2 is the standard,25Gbps x2 could be configurable with PCIe(for DPU)