The "wedge" part under "3. Mode Connectivity" has at least one obvious component: Neural networks tend to be invariant to permuting nodes (together with their connections) within a layer. Simply put, it doesn't matter in what order you number the K nodes of e.g. a fully connected layer, but that alone already means there are K! different solutions with exactly the same behavior. Equivalently, the loss landscape is symmetric to certain permutations of its dimensions.
This means that, at the very least, there are many global optima (well, unless all permutable weights end up with the same value, which is obviously not the case). The fact that different initializations/early training steps can end up in different but equivalent optima follows directly from this symmetry. But whether all their basins are connected, or whether there are just multiple equivalent basins, is much less clear. The "non-linear" connection stuff does seem to imply that they are all in some (high-dimensional, non-linear) valley.
To be clear, this is just me looking at these results from the "permutation" perspective above, because it leads to a few obvious conclusions. But I am not qualified to judge which of these results are more or less profound.
The "functional equivalence" discussed in the past gets at some of this. There's definitely more going on than permutation symmetry; in particular, if two solutions are symmetrically related (and thus evaluating the same overall function) then ensembling them together shouldn't help. But ensembling /does/ still help in many cases.
The different solutions found in different runs likely share a lot of information, but learn some different things on the edges. It would be cool to isolate the difference between two networks...
Completely agree! Plus, less trivially, there can be a bunch of different link weight settings (for an assumed distribution of inputs) that result in nearly-symmetric behaviors, and then that is multiplied by the permutation results you have just mentioned! So, it's complicated...
This means that, at the very least, there are many global optima (well, unless all permutable weights end up with the same value, which is obviously not the case). The fact that different initializations/early training steps can end up in different but equivalent optima follows directly from this symmetry. But whether all their basins are connected, or whether there are just multiple equivalent basins, is much less clear. The "non-linear" connection stuff does seem to imply that they are all in some (high-dimensional, non-linear) valley.
To be clear, this is just me looking at these results from the "permutation" perspective above, because it leads to a few obvious conclusions. But I am not qualified to judge which of these results are more or less profound.