The ubiquity of connected devices and parallel computing platforms challenges ef-ficient and reliable execution of machine learning algorithms. If machine learningworkloads are executed merely locally, a system does not always have sufficient re-sources at its disposal to perform the necessary operations fast enough. Furthermore,at a smaller scale, multiple hardware components these days are interconnected viaon-chip or off-chip networks to create many-core systems. Communication, synchro-nization, and offloading have thus become essential in designing embedded systemsunder communication and resource constraints.This chapter presents (1) the timing predictability of embedded systems and (2) thecommunication architecture in heterogeneous CPU/GPU environments. Synchroniza-tion with resource sharing, communication with potential failures, and probabilistictiming information are presented in Section 8.1. Bandwidth limitations of different exe-cution models and coprocessor-accelerated optimization are presented in Section 8.2.