Saturday, April 18, 2020

Help us preserve the original Furby!



Update 2020-04-27

Given submissions have tapered off we've taken down the server. Thanks to all that have contributed so far! We'll let everyone know when more information is available.

Update 2020-04-26

Thanks to all of those who have contributed! We've been running for about a week, and results have leveled off around 80% complete and currently around 84 complete%. We've now have some statistics and a few proposals.

Some basic statistics:
  • Pages: 297
  • Lines: 19510
  • Page submissions: 744
  • Line changes (roughly): 10297
  • Change 2/3 agree: 9191
  • Can't 2/3 agree: 1106
Of those 297 pages, we have all of them with at least two submits and about 50% have three submits. These results were combined to result in about 50% of lines flagged for adjustment. Of those suggestions, about 89% of agree. Based on existing data, getting all 3 sets of challenges completed will reduce that to about 600 still requiring manual review.

A few more advanced heuristics were also tried (ex: partial line matching, weighting user results based on how much we trust their results), but ultimately wasn't convinced any of these are the right approach.

So, where does this leave things? Two main options are being considered:
  • Push the annotated source to github or gitlab as is. We estimate that it would take someone about 6-12 hours to fix, which is not intractable. Default would have been the furby-source repository on github, but they have stopped responding
  • Restart the crowdsource server using the best result with annotated conflicts. Users would need to delete the extra lines and submit. However, we suspect users need a break, so at a minimum we would probably hold off a few months to regain momentum
Note we suspect additional fixes will be required upon eventual manual review, whichever path is taken. Generally the first option seems like the best. A few dedicated users could knock this out fairly quickly without too much coordination. If we get a few volunteers (or one very dedicated volunteer), we'll figure out where to push this and move the project forward. Ideally one of these people would also be interested in coordinating other community contributions.

So we're asking if people are interested in the first option and we'll likely default to the second if we don't get traction. Please let us known here in the comments or on Twitter!

Update 2020-04-20


Higher quality .pngs have been swapped in after reports that compression is swapping letters (!). Special thanks to Video Game Preservation Collective for the above image! The old set was from the text annotated version while the new set is believed to be the original scan. Unfortunately these images are about 5x larger, but should improve accuracy.

Also now we've done a very crude analysis of the existing submits and used them to make a quick guess at better default text to present. This effects about 85% of entries. So going forward you'll typically get higher quality defaults. But please still be attentive and look for errors!

There have also been a few backend tweaks, notably favoring showing pages with fewer submissions. However these generally should not be visible externally.

Update 2020-04-19

We're up to 197 submissions! Thanks to all of you that have posted so far! We need to meet a minimum of 297, so we're making great progress. Our goal is to get 3 submissions to help correct errors, for a total of 891.

We will briefly bring down the site for maintenance at 2020-04-21 6:00 AM. We will use this window to improve the default text based on submissions so far. This should make challenges much easier as mostly you'll only need to do small corrections instead of large edits. We will also fix the overall progress indicator, which currently says 1485 required, but it should be 891.

Once again, thanks for your help and please let us know if you have any feedback!

Micro update: the progress indicator fix has been pushed out (it was not necessary to bring the server down)

Background

The Furby is an iconic talking toy from the late 90s. A couple of years ago scans of the original Furby source code were acquired. Unfortunately the scans are noisy and automatic image to text conversion is difficult. So we're asking the community to help preserve game history by proofreading computer generated transcripts. Generating a proper copy of the Furby source code will be enormously valuable to understanding how it works!

Project TLDR:
  • Complete using your web browser
  • You need a large screen (laptop or desktop)
  • Scanned image at left, noisy text interpretation at right
  • Fix errors in the image to text translation and submit
  • Remove headers and footers (ex: "Page 6", "A-121", "Diag7.asm" ) 
  • Unreadable: put best guess if possible, or random characters as last resort (will flag for review)

Although the crowdsourcing system wasn't a good fit for Great Swordsman, it spurred some conversations on what it could be used for. It has been revived and adapted to work on improving pdf image to text conversion.

Join the effort by signing up for an account! If you had an account on the previous TGP project, it likely is still available. Additional instructions are available after creating an account. If you have some time, please try a few images!

Finally, the person who gets the most pages accepted (ie with acceptable accuracy) will get early blog access for 3 months! Note however you must provide your e-mail address to qualify so that we can actually send it to you.

Sounds good? Sign up here! Instructions are available after logging in.

Note: due to various issues we are unable to split the pages into smaller tasks. So the images are relatively large and this is best completed on systems with a large screen such as a laptop or a desktop. So apologies if you only have mobile, but you may not be able to help with this specific project.

Special thanks to Andrew Gardner for writing the original tool and John McMaster for recent modifications!

FAQ

We'd also love if you have suggestions for improving the work flow. These are things already on our mind:


Q: What happened after the last crowd sourcing project? (Fujitsu DSPs / TGPs)

A: Post processing took a while, but it ultimately led to massive improvements on how well the community understands these games. However we've been doing a poor job at communicating those results and still need to write a post about it. See for example this MAME post which mentions recovering "...the Sega Model 1 coprocessor TGP programs for Star Wars Arcade and Wing War, making these games fully playable."


Q: Can you make the challenges smaller?

A: Not easily. The pages aren't well aligned, we'd need to both figure out correct straightening and cropping


Q: Can you align the text editor to the images better? Maybe rich text features like find and replace?

A: While the chip community can unlock the secrets of the micro universe, we can't code websites for beans. Really it's a miracle that the site is running at all. If you can help with improving text entry, please reach out! FYI its written in Python/Django and could use some cleanup. If you haven't been scared off, more info is here



Q: What happens after its captured?

A: First we'll post process to remove errors. After that we'll use the CPU manual to make a special 6502 assembler to create a binary. Ideally we'll also combine this with the Furby 70-800 ROM microscope images (sample above) at some point.


Q: Where did the source come from?

A: Not sure exactly, but some information is available at the Internet Archive


Q: Can I edit my result after submission?

A: It is not possible to modify it at this time. But don't worry, most of the time we can detect errors by combining a few results.


Q: Can you reset my password?

A: Yes, but it requires manual admin intervention. We suggest creating a new account if you aren't really tied to your old one


Q: Isn't that Furby image for the Furby 2012, not the original Furby?

A: Maybe... Actually we have a 70-800 image now

Prologue

More questions? Type them below, or reach out to us on Twitter. Thanks again for your help!

13 comments:

  1. Unrelated, but that isn't a Boom, either - it's a 2012 model.

    ReplyDelete
    Replies
    1. Oh ha I totally forgot about the 2012, you are right! I'll update that

      Delete
  2. > Q: Where did the source come from?
    The scan itself (and the effort to obtain it from the USPTO) is courtesy of Sean Riddle (seanriddle.com).
    I uploaded it to archive.org.

    ReplyDelete
  3. Me and some other people over at VGPC came to the conclusion that the images that are used in the application contain lots of errors. Somehow the scans seem to be altered, in a 'bad' way. Characters are changed (creating seemingly type errors) or are missing (for example, I've been told there are missing semicolons, which are according to your own tutorial very important).

    Someone noticed that the source on archive.org is different from the source on https://seanriddle.com. It seems seanriddle's source has been OCR'ed and contains the same mistakes, while the scans on archive.org are raw and (seemingly) untouched.

    Since you came to the conclusion that the OCR isn't (fully) working, it would, in my opinion, be better to feed the untouched images to the humans doing the work instead of the altered images. This avoids a lot of unintentional typos that the OCR changed in the images.

    ReplyDelete
    Replies
    1. Indeed - I've already suggested that on Twitter.
      It's because seanriddle.com started redirecting to archive.org's default (OCR-optimized) version after a massive traffic spike back when this made news at Y Combinator.

      Delete
  4. Currently (Tue Apr 21 18:19:28 UTC 2020), the server seems to be experiencing some sort of an issue, with the connection to cs.siliconpr0n.org:8000 being slow and seeming to have a high ratio of dropped packets.

    ReplyDelete
    Replies
    1. Hmm...I see those issues in the log ("Connection timed out"). Looks like something started around 21/Apr/2020 18:08:56 UTC. Currently I'm able to connect and I'll monitor the situation to see if I can figure out whats going on.

      There have been a lot of proxy scans and (irrelevant) proxy attacks, so my quick guess is that its related to that. I'll dig a bit deeper

      Thanks for the heads up!

      Delete
  5. Regarding the latest update - why can't both options 1 (storing the preliminary result in a repository) and 2 (using a setting similar to the existing one for corrections) be implemented?

    For example, the changes accepted into the repo could be later used to reduce the number of transcription conflicts when #2 is implemented.

    ReplyDelete
    Replies
    1. The second step might even end up becoming unnecessary.

      Delete
    2. Yes, this is a possible solution and is being evaluated

      Delete
  6. I have been looking at the Furby source for a while now (I started transcribing before I knew of this effort.) I have come to the conclusion that significant amounts of code and data are missing from the PTO scans, and that reading the ROM masks is going to be needed to generate a complete source file.

    ReplyDelete
  7. is the cs dot sipr0n dot org project still around? bored and got the bug to contribute some more...

    ReplyDelete