Many non-volatile memories (NVM) suffer from a severe reduced cell endurance and therefore require wear-leveling. Heap memory, as one segment, which potentially is mapped to a NVM, faces a strong application dependent characteristic regarding the amount of memory accesses and allocations. A simple deterministic strategy for wear leveling of the heap may suffer when the available action space becomes too large. Therefore, we investigate the employment of a reinforcement learning agent as a substitute for such a strategy in this paper. The agent’s objective is to learn a strategy, which is optimal with respect to the total memory wear out. We conclude this work with an evaluation, where we compare the deterministic strategy with the proposed agent. We report that our proposed agent outperforms the simple deterministic strategy in several cases. However, we also report further optimization potential in the agent design and deployment.