The reason for such score differences that you observe are due to an inherent problem of the FlexX scoring model rather than actually being a bug. Score differences of +/- 2 units are actually quite usual and if a hydrogen bond is not found (e.g. due to a different orientation of a hydrogen in two different protein structures) the difference may even rise up to 4 units easily.
The FlexX scoring, in addition to other scoring functions used in docking are in fact very sensitive against small geometric changes. But on the other hand if the scoring functions would not be so sensitive it would be rather hard to distinguish between different solutions. (This is actually one of the main problems in scoring in general.)