# Fitting a grid to arbitrary points with reinforcement learning and Unity ML Agents

*
*

For Ludum Dare 48 I tried out Unity ML agents. The idea was to find a way to take arbitrary transform data, and fit it to a grid such that I could start reasoning about level data through grid block neighbors.

This was an experiment to determine the feasibility of the approach as a pre-processing step in a larger plan to apply wave function collapse like logic to non-discrete levels.

The reward function was a normalized sum of two criteria:

- The number of transforms alone in a cube
*vs*the number of total cubes - The number of full cubes with at least one transform inside
*vs*the number of total cubes

These were offset so a negative reward was produced in cases were these criteria were badly met.

There were also a number of failure cases when the grid size got too small/large that would restart the agent.

### Some noteworthy learnings

Agents learn better when they perform step-wise actions. So rather make your agents move the dial of whatever space you are trying to explore slowly. This related to how QLearning works. It looks to predict the future reward of the current action. In my case this came down to shifting the grid scale and offset incrementally rather than using the absolute value returned from the agent.

The agent will find a local minima if it is not reset. Make sure values are sufficiently randomized on restart to allow the agent to experience varied environments.

I was trying to solve a problem that seemed to have no clear best solution. Either there were many empty blocks and few transforms sharing a block. Or many transforms sharing a block and few empty squares. These two opposing criteria in the reward function lead to the agent often fighting between the two extremes, where the reward function weighted either outcome evenly. I think I would need to choose one over the other to get a more stable solution. The reality is that grids don't fit arbitrary data.

### Failings and potential improvements

There is still much to learn WRT tweaking hyper-parameters and network settings.

I suspect this type of approach would be a lot more successful with a tree-like data structure, where we could optimize the tree for certain criteria.

I experimented with a number of different reward function designs to differing degrees of success but I think there are better options than the one I ended up using.

In the future, I think ML-Agents is better suited to problems with clear best case solutions. The agent struggled to find a decent solution. I would too.