Sunday, November 18, 2012

Pelican Imaging Capabilities Presented

Pelican Imaging starts to present its approach at different forums. Apparently, Mobiletrax' post is based on one of those presentations:

"The founders of Pelican Imaging started with the idea of using an array of low resolution photoplanes coupled to an array of small lenses and using the overlapping information to create astounding images and videos.

The core intellectual property (IP) starts by using an array of 16 inexpensive, mass produced accurately aligned cameras. The array creates 16 images – each one slightly different from the other since it is capturing the image from a slightly different angle. Because the input images all come from lenses and pixel technologies that are mature and low cost, the yield of the solution should be very good. You can see the 16 ‘similar’ images captured by using the 16 lenses in Figure 2.
"


"Because each image is taken from a slightly different angle, the processing logic can determine the distance objects are from the camera, thus providing a depth map of the scene... And, the 16 slightly different versions of the photo offer the ability for Pelican’s proprietary (and patent pending) software to synthesize a higher resolution image."

Some of the camera capabilities are shown in pictures:

29 comments:

  1. Interesting for sure. Wonder what the drawbacks are besides processing power and more pixels? It would be always be good to see some real side by side image comparisons. The ones shown seem to favor Pelican a little too much.

    ReplyDelete
    Replies
    1. 1. Lytro needs ~10Mpx sensor to generate a 1Mpx image. I would expect a similar ratio here. Example: We are taking a picture of a fence and a highly textured object behind it.
      Even for a flat scene, we have 16 "huge" samples and want to recover high frequency info = superresolution (with its limitations). + nonuniform sampling

      2. They use different exposures (4x shorter, 4x longer, 8x middle). What's the purpose?

      Delete
    2. 2. Maybe they are not using different exposures, but instead 8 sensors record green, 4 red, 4 blue, thus they'd neatly get away from using color filter array?

      Delete
    3. Nice interpretation

      Delete
    4. Great idea, you are right, I was able to combine them. But then we can expect color artifacts on depth edges.

      Delete
  2. That's why they don't show full images !

    ReplyDelete
  3. How can they measure 13 feet distance with a baseline of a few millimeters or probably less ?

    ReplyDelete
  4. The baseline between the cameras at the edges of the array is close to 1cm, which is enough.

    ReplyDelete
    Replies
    1. 10 mm sensor width. That's a big sensor. What res are they targeting ? Doesn't sound mobile.

      Delete
    2. I don't think the die needs to be so big, but definitely much bigger than single lens imaging for the same resolution. But probably some module costs are reduced such as AF. It is an interesting trade space. Still, there needs to be a compelling reason for adoption (and displacement of the incumbent single-lens camera, just not a flat trade space. What is that reason? -- although thinner module height has to be a pretty good one.

      I also wonder for a "synthesized" higher resolution image, if it is truly higher resolution or just similar to interpolated higher resolution images from single-lens cameras.

      Delete
  5. what is the error at 13 ft away? and at 7ft away?

    ReplyDelete
  6. it seems to be a nice idea to overcome mechanical issues (lens,shutter) by exploiting the sensor technology development

    ReplyDelete
  7. Is the resolution of this camera they show 16MP as well?

    ReplyDelete
  8. Who makes this sort of a sensor? It would seem that they need several mini pixel arrays.

    ReplyDelete
  9. isn't the core IP owned by MICRON?

    ReplyDelete
  10. In figure 2, why is there no disparity between the individual sensors?

    Whoever came up with the fake image doesn't even know enough about image formation and multi-view photography to make it a little more realistic. I hope that didn't come from Pelican. If it did, I would have a couple concerns:

    1. why create fake data when they have the hardware to take real data?
    2. shouldn't they know enough to generate more realistic fake data?

    Something smells fishy to me.

    ReplyDelete
    Replies
    1. Why do expect to see a few pixels of disparity?

      Delete
    2. (1) http://media.tkk.fi/visualmedia/reports/EI-papers-2011/Kyto_MethodForMeasuringStereoCameraDepthAccuracy.pdf

      pix-disparity * pix-size = focal-length * baseline / depth

      Assuming 10mm total width, then baseline between the farthest sensors is ~7.5mm. It's a little larger on the diagonal sensors if projected into the epipolar line. Conservatively, assume pix-size is 3um (on the large end since consumer sensors typically range from 1.4-2.2um).

      (*)
      pix-disparity = focal-length * 7.5mm / (depth * 3um)
      = (focal-length / depth) * 2500

      Assuming the model's head is 100mm wide (on the small end) and each sensor is 2.5mm wide (4 sensors in 10mm with no spacing), then

      (**)
      100 mm / depth = k * 2.5mm / focal-length

      where k is the % of coverage her face takes up on the sensor. in this case, it's about 1/3 judging from the fact that her face takes up 1/3 of the image. This comes out to:

      (***)
      focal-length / depth = 0.0083333

      Plugging (***) into (*) gives:

      pix-disparity = 0.008333 * 2500 = 20.8333 pixels

      All the numbers have a little wiggle room since I don't have their specs, but you'd have to wiggle a lot to make 20 pixel disparity go away. The guesses have been made conservatively, so chances are that the actual disparity would be a little larger. For example, chances are that their pixel size is < 3um, which would proportionately increase the # of pixel disparity.

      Unless I am doing something wrong, I would expect a vertical shift between the top and bottom sensors of the same column. Likewise, a horizontal shift between the left and right sensors of the same row.

      Now suppose I were wrong and that disparity is subpixel at .1-1m (roughly where the model is sitting), then there is a bigger problem. I don't see how resolution at 13 ft away is worth anything (Figure 4) since (1) tells us that depth error varies with the square of depth.

      Unless the underlying image contains higher frequency data than Nyquist cut off, subpixel matching can cause other issues:

      (2) http://perso.telecom-paristech.fr/~delon/Pdf/2007_Small_Baseline_Stereovision_JMIV.pdf
      (3) http://www.cse.cuhk.edu.hk/~leojia/papers/upsampling_cviu2012.pdf

      If the underlying image contains higher frequency than the Nyquist cut off, then a de-aliasing procedure is required to remove aliasing artifacts.

      On top of that, we know that subpixel matching for multi-view is computationally more expensive (from 2 and 3) even without de-aliasing.

      Sorry to be overly expressive, but this seems like typical marketing gimmick that plagues the technology industry.

      Delete
    3. Thanks. Wow. I agree. These imges are most probably FAKE. Sad. I think they should just wait untill they have real images to show before going public.
      After the Nokia tricks with the camera man and all fake images they published others are following.

      Delete
    4. From the descriptions, I understood that Pelican's technology was related to the Stanford multi-cam array, see here: http://graphics.stanford.edu/projects/array/
      It's a really different depth estimation technology than standard stereo vision, hence results such as the Delon and Sabater's ones do not apply (the small baseline ref. is a work that was targeting airborne imaging contexts).

      The main interrogation is "how does this technology scale to 16 cameras instead of 256" ? Because the depth estimation algorithm from Stanford uses color histograms, hence requiring enough measurements to be reliable.

      For the image super-resolution, state-of-the-art algorithms can easily produce a 3x magnification (on each side) from 16 images. The main issue is registering the images together, but it is easier to solve with the Stanford multi-cam array calibration framework than it is "in the wild".

      So it's probably not fake, but as usual regular results may not be as nice as the ads.

      Delete
    5. Regardless of whether you have 2 or more than 2 sensors, the geometry and image processing are similar. Extensions have been made to multi-view. See, for example, http://www.epixea.com/research/multi-view-coding-thesisse15.html

      Delon's work is one of the few that demonstrate (mathematically as well as experimentally) good subpixel accurate correspondence. It was brought up in this context.

      The dispute was not from whether it is possible to build an array camera. Other people have done before: TOMBO (http://ir.library.osaka-u.ac.jp/dspace/bitstream/11094/3086/1/ao40_11_1806.pdf), PERIODIC (http://www.cs.wfu.edu/~pauca/publications/ACMSE2007periodic.pdf), and many others with as few as stereo, 2x2, or 3x3.

      The dispute is why they published fake data and why the fake data wasn't faked appropriately. Figure 2 is most likely fake because the images from the individual sensors are exactly the same. One would expect to see disparity between the images as worked out above.

      The fact that it is 16 instead of 256 shouldn't be the issue. If you are concerned with getting accurate depth, there are tons of reasonable algorithms that can get dense depth from stereo: http://vision.middlebury.edu/stereo/
      http://vision.middlebury.edu/stereo/eval/

      I don't know where you are getting 3x super-resolution magnification from, but the math clearly prevents this. We know that images recorded by the sensor get filtered in two ways. The first is optical and the second is sensor spatial integration. Supposing that their optics is perfect (thereby introducing aliasing, which is another problem), the sensor spatial integration would prevent them from getting a perfect 3x. Sensors integrate over the size of the pixel. This rectangular averaging effectively removes (or attenuates) frequencies higher than Nyquist. The higher frequency information is degraded or removed altogether. How can you claim to do 3x magnification when the information just isn't entirely there?

      To get an intuition for this, consider, for example, a vertically oriented sine wave with a period that fits completely into a pixel of the individual sensors. This sine wave would look like a uniform gray in all the sensors. No amount of computation can tease this apart from an actual uniform gray image. I realize this example is the worst case scenario because applying a rect filter only notches out harmonics of the Nyquist rate. Other higher frequencies only get severely attenuated (Fourier transform of rect is sinc, and sinc has higher lobes).

      Now this isn't to say that you can't recover anything. You can still do some form of super-resolution since some of the higher frequency information is still there and you can make an attempt at filling in for missing frequency. Other people have done this. See for example:

      http://hci.iwr.uni-heidelberg.de/Staff/bgoldlue/papers/GC09_STD.pdf

      However, whether you use a statistical Bayesian type method like MAP or ML or a global calculus of variation method, the computational cost is super expensive. Add to this the fact that (at some frequencies) you are actually "filling" in for information that just isn't there anymore -- i.e. the problem is inherently ill-posed and the solutions space is much bigger than the problems space. I guess if they manage to do something, it would be like Lytro. Take a picture, upload to your computer, and wait a long time to get the result.

      That aside, you can get some form of a synthesized higher resolution image. There is still a limit to that:

      http://research.microsoft.com/pubs/69073/2004-TPAMI-SR.pdf

      3x is probably in perfect setting.

      Delete
    6. There is no apparent parallax in figure 2 because the subject is close to the camera and there is very little scene depth available for comparison of the sub-image parallax.

      I am sure they cannot get away with faking images at this point as many OEM are following their progress and I would expect OEMs have the camera in their labs.

      Delete
    7. You mean the subject is far from the camera and there is little scene depth - in this case, your point makes senses.

      Delete
    8. The problem is that the subject fills up 1/3 of the image. This means that the subject can't be too far from the camera unless the focal length is long. if the focal length is long, then a subject that is far away would still create disparity between the sensors.

      The mathematical details of this was worked out above. For an object that is about 100mm (80mm face + 20mm hair) wide to be projected into 4x4 array of sensors with a footprint of 10mm x 10mm, there must be a disparity between the far sensors of about 20 pixels.

      If they have cameras in the lab, why not use it to take a picture for marketing purposes? And therein lies the "fishy-ness".

      Delete

    9. >If they have cameras in the lab, why not use it to take a picture for marketing purposes? And therein lies the "fishy-ness".

      Perhaps there are no pretty models available in the Pelican lab? I would not get bent out of shape over a marketing slide. Looks to me like it was designed to illustrate the concept, not show experimental results, although the caption is slightly misleading. But even the caption may not have been written exactly by Pelican. Let's just wait until we see some real results.

      Delete
    10. I absolutely understand your point. The gripe was exactly about the fact that it's misleading (1). Maybe it's unintentional, but in my mind there is a huge difference between "a depiction of how the camera works" and "captured using their array". It also seems to me that they should want to make it a little more realistic (2), but I guess that depends on whether that image was made by marketing or engineering.

      I wonder when they will present real results from the camera, and, on a side note, if they don't have pretty models in the lab, how on earth do they expect to recruit good engineers?

      Delete
  11. Honestly, to have played during 1 year on both the Stanford data and a university-made camera array, I can tell you that tackling a stereo approach or an explicit "many views in a planar configuration" is really different and leads to different algorithms both for the calibration and the depth estimation. (You can actually consider that the classical stereo rectification is a limit case of the camera array calibration, I found 1 stereo paper that could be exploited to make this link.)

    Having 16 or 256 cameras does really matter if you leverage the same algorithms as Stanford's multi-cam array (and I insist, these are not the usual shape-from-motion ones, integrating more information allows for simpler approaches) because the best approach that they tested relies on color histogram entropy among the input cams. You can accurately estimate this entropy if :
    1. you have enough measures, i.e. enough cameras
    2. you have good quality cameras (especially you need accurate color calibration). This can be really difficult to achieve with very small sensors as they seem to have in the product.
    These 2 points can actually lead to failures in the product, and together they can explain the fake images and future errors in the depth estimation : if you can't measure an accurate color, you flatten your color histogram, and by removing cams (256 ---> 16) you lessen the chance to observe a good peak in the histogram even for the correct depth.

    ReplyDelete
    Replies
    1. I think your experience with data from the Stanford camera would probably then tell you that multi-view depth computation (which includes stereo) inherently is a search. There are generally two (and a half) kinds of variability between different kinds of algorithms. (a) The first is the search method, which affects primarily run time. (b) The second is the search metric (cost function), which affects run time and quality. The remaining half is due to the fact that multi-view correspondence is an ill-posed problem. (c) Some people impose additional artificial constraints (called regularization) to improve the solution. An example of a common artificial constraint is the the requirement that the depth map needs to be piece-wise smooth.

      What was discussed is the geometry. That there should be a disparity between the far sensors is a question of geometry supposing that you already know the depth, which can be estimated from the size of the model's head.

      On the second point of what you said, I would like to remind you that the exact statement was:

      "The fact that it is 16 instead of 256 shouldn't be the issue. If you are concerned with getting accurate depth,..."

      Your response is "it does matter if you want to leverage the same algorithms as Stanford's...". Sounds like you are implying that Stanford's is the only method to get accurate depth? If yes, I would recommend you to do a literature survey of work outside of Stanford. If no, then there is no point of disagreement. It's a matter of logic.

      (d) "P doesn't matter if you need to do Q [because R can do Q very well]"
      (e) "P does matter if you want to use S to do Q"

      you can see why your statement about Stanford is irrelevant.

      That aside, I am assuming that you got color histogram entropy from Vaish's publications. I just want to remind you that color histogram entropy is one metric for correspondence (see b above), but it is by no means the only metric for correspondence. Pelican does not need to use this metric.

      The dispute above with respect to fake or real image is specifically on the image formation process, not the analysis, calibration, depth extraction, or fusion process. If the laws of nature hold, it must be true that the far images need to exhibit a disparity of about 20 pixels. Their image (in Figure 2) does not show any disparity (or otherwise known as parallax), so either they have figured out a way to violate the laws of nature, or they're faking (photoshoping) the image. That's the whole purpose of all the math and references.

      If it is in fact the case that they did not fake the image, I would suggest Pelican change its business to building perpetual motion machines. That may be a more lucrative market than cameras.

      Delete
    2. There is a possibility that they have cropped all the images to show the common image space as the uncropped images would take up significantly more space and the array they show is easier to understand. Also analysis of the the individual picture (take top left and bottom right) do show differences (not normal jpg differences) that could be caused by perspective change. So maybe not so fake but more massaged into something presentable.

      Delete

All comments are moderated to avoid spam and personal attacks.