On a more complicated level, you may be protecting yourself from legal and illegal content usage. At the end of the day, a crawler can only discern what is integral to your content – how things appear visually on page. A crawler cannot detect foreign content that is being used, logistically, for limited purposes and that cannot be used for broader ones. This is where you may need to step in for more fine tuned, detailed culling of what can be garnered via crawling.
As evidence of this, many people have complained that their rankings dropped specifically with the Panda 4.0 release. Indeed, in that particular thread, the Google representative admitted that disallowing access to any URL which affects visual display significantly should be avoided. So, that’s the rather confusing, short of it – Google maintains that disallowing crawls will not significantly impact the Panda algorithm’s ranking while at the same time they say that disallowing it may make the site difficult to “read”. It could not be much less clear than that. But, Google always tends to be closed about specifics of their algorithms to help protect them from people trying work the system.
Bearing that in mind, it may be the safest of all options to just allow the crawls instead of risk the negative impact. If you find that unacceptable, a middle ground might be to allow access to the ones which are central to visual formatting, so Panda can reassemble the page properly when it retrieves it. But going with the safe as possible route, it is probably best to simply allow them if you can. And, of course, if it is a planned scan you are prepping your site for – you can always allow crawling just for the time it takes to perform the scan them turn disallow back on after it is complete.