2012-05-30 2 views
7

Vielen Dank für das Lesen.FEHLER "Zusätzliche Daten: Zeile 2 Spalte 1" bei der Verwendung von pycurl mit gzip Stream

Hintergrund: Ich versuche, einen Streaming-API-Feed zu lesen, die Daten in JSON Format zurückgibt, und diese Daten dann an ein pymongo collection speichern. Die Streaming-API benötigt einen Header "Accept-Encoding" : "Gzip".

Was passiert: -Code nicht auf json.loads und Ausgänge - Extra data: line 2 column 1 - line 4 column 1 (char 1891 - 5597) (Error Log unten) Siehe

dies nicht geschieht, während jedes JSON-Objekt-Parsing - es zufällig passiert.

Meine Vermutung ist, dass ich nach jedem "x" richtigen JSON-Objekte ein seltsames JSON-Objekt begegne.

Ich habe Referenz how to use pycurl if requested data is sometimes gzipped, sometimes not? und Encoding error while deserializing a json object from Google, aber bis jetzt waren bei der Lösung dieses Fehlers nicht erfolgreich.

Könnte mir bitte jemand hier helfen?

Fehlerprotokoll: Hinweis: Die rohe Dump des JSON-Objekt unten im Grunde ist die repr() Methode, die die rohe Darstellung der Zeichenfolge druckt ohne CRLF/LF Lösung (en).


'{"id":"tag:search.twitter.com,2005:207958320747782146","objectType":"activity","actor":{"objectType":"person","id":"id:twitter.com:493653150","link":"http://www.twitter.com/Deathnews_7_24","displayName":"Death News 7/24","postedTime":"2012-02-16T01:30:12.000Z","image":"http://a0.twimg.com/profile_images/1834408513/deathnewstwittersquare_normal.jpg","summary":"Crashes, Murders, Suicides, Accidents, Crime and Naturals Death News From All Around World","links":[{"href":"http://www.facebook.com/DeathNews724","rel":"me"}],"friendsCount":56,"followersCount":14,"listedCount":1,"statusesCount":1029,"twitterTimeZone":null,"utcOffset":null,"preferredUsername":"Deathnews_7_24","languages":["tr"]},"verb":"post","postedTime":"2012-05-30T22:15:02.000Z","generator":{"displayName":"web","link":"http://twitter.com"},"provider":{"objectType":"service","displayName":"Twitter","link":"http://www.twitter.com"},"link":"http://twitter.com/Deathnews_7_24/statuses/207958320747782146","body":"Kathi Kamen Goldmark, Writers\xe2\x80\x99 Catalyst, Dies at 63 http://t.co/WBsNlNtA","object":{"objectType":"note","id":"object:search.twitter.com,2005:207958320747782146","summary":"Kathi Kamen Goldmark, Writers\xe2\x80\x99 Catalyst, Dies at 63 http://t.co/WBsNlNtA","link":"http://twitter.com/Deathnews_7_24/statuses/207958320747782146","postedTime":"2012-05-30T22:15:02.000Z"},"twitter_entities":{"urls":[{"display_url":"nytimes.com/2012/05/30/boo\xe2\x80\xa6","indices":[52,72],"expanded_url":"http://www.nytimes.com/2012/05/30/books/kathi-kamen-goldmark-writers-catalyst-dies-at-63.html","url":"http://t.co/WBsNlNtA"}],"hashtags":[],"user_mentions":[]},"gnip":{"language":{"value":"en"},"matching_rules":[{"value":"url_contains: nytimes.com","tag":null}],"klout_score":11,"urls":[{"url":"http://t.co/WBsNlNtA","expanded_url":"http://www.nytimes.com/2012/05/30/books/kathi-kamen-goldmark-writers-catalyst-dies-at-63.html?_r=1"}]}}\r\n{"id":"tag:search.twitter.com,2005:03638785","objectType":"activity","actor":{"objectType":"person","id":"id:twitter.com:178760897","link":"http://www.twitter.com/Mobanu","displayName":"Donald Ochs","postedTime":"2010-08-15T16:33:56.000Z","image":"http://a0.twimg.com/profile_images/1493224811/small_mobany_Logo_normal.jpg","summary":"","links":[{"href":"http://www.mobanuweightloss.com","rel":"me"}],"friendsCount":10272,"followersCount":9698,"listedCount":30,"statusesCount":725,"twitterTimeZone":"Mountain Time (US & Canada)","utcOffset":"-25200","preferredUsername":"Mobanu","languages":["en"],"location":{"objectType":"place","displayName":"Crested Butte, Colorado"}},"verb":"post","postedTime":"2012-05-30T22:15:02.000Z","generator":{"displayName":"twitterfeed","link":"http://twitterfeed.com"},"provider":{"objectType":"service","displayName":"Twitter","link":"http://www.twitter.com"},"link":"http://twitter.com/Mobanu/statuses/03638785","body":"Mobanu: Can Exercise Be Bad for You?: Researchers have found evidence that some people who exercise do worse on ... http://t.co/mTsQlNQO","object":{"objectType":"note","id":"object:search.twitter.com,2005:03638785","summary":"Mobanu: Can Exercise Be Bad for You?: Researchers have found evidence that some people who exercise do worse on ... http://t.co/mTsQlNQO","link":"http://twitter.com/Mobanu/statuses/03638785","postedTime":"2012-05-30T22:15:02.000Z"},"twitter_entities":{"urls":[{"display_url":"nyti.ms/KUmmMa","indices":[116,136],"expanded_url":"http://nyti.ms/KUmmMa","url":"http://t.co/mTsQlNQO"}],"hashtags":[],"user_mentions":[]},"gnip":{"language":{"value":"en"},"matching_rules":[{"value":"url_contains: nytimes.com","tag":null}],"klout_score":12,"urls":[{"url":"http://t.co/mTsQlNQO","expanded_url":"http://well.blogs.nytimes.com/2012/05/30/can-exercise-be-bad-for-you/?utm_medium=twitter&utm_source=twitterfeed"}]}}\r\n' 
json exception: Extra data: line 2 column 1 - line 4 column 1 (char 1891 - 5597) 

Header-Ausgang:


HTTP/1.1 200 OK 

Content-Type: application/json; charset=UTF-8 

Vary: Accept-Encoding 

Date: Wed, 30 May 2012 22:14:48 UTC 

Connection: close 

Transfer-Encoding: chunked 

Content-Encoding: gzip 

get_stream.py:


#!/usr/bin/env python 
import sys 
import pycurl 
import json 
import pymongo 

STREAM_URL = "https://stream.test.com:443/accounts/publishers/twitter/streams/track/Dev.json" 
AUTH = "userid:passwd" 

DB_HOST = "127.0.0.1" 
DB_NAME = "stream_test" 

class StreamReader: 
    def __init__(self): 
     try: 
      self.count = 0 
      self.buff = "" 
      self.mongo = pymongo.Connection(DB_HOST) 
      self.db = self.mongo[DB_NAME] 
      self.raw_tweets = self.db["raw_tweets_gnip"] 
      self.conn = pycurl.Curl() 
      self.conn.setopt(pycurl.ENCODING, 'gzip') 
      self.conn.setopt(pycurl.URL, STREAM_URL) 
      self.conn.setopt(pycurl.USERPWD, AUTH) 
      self.conn.setopt(pycurl.WRITEFUNCTION, self.on_receive) 
      self.conn.setopt(pycurl.HEADERFUNCTION, self.header_rcvd) 
      while True: 
       self.conn.perform() 
     except Exception as ex: 
      print "error ocurred : %s" % str(ex) 

    def header_rcvd(self, header_data): 
     print header_data 

    def on_receive(self, data): 
     temp_data = data 
     self.buff += data 
     if data.endswith("\r\n") and self.buff.strip(): 
      try: 
       tweet = json.loads(self.buff, encoding = 'UTF-8') 
       self.buff = "" 
       if tweet: 
        try: 
         self.raw_tweets.insert(tweet) 
        except Exception as insert_ex: 
         print "Error inserting tweet: %s" % str(insert_ex) 
        self.count += 1 

       if self.count % 10 == 0: 
        print "inserted "+str(self.count)+" tweets" 
      except Exception as json_ex: 
       print "json exception: %s" % str(json_ex) 
       print repr(temp_data) 



stream = StreamReader() 

Fest Code:


def on_receive(self, data): 
     self.buff += data 
     if data.endswith("\r\n") and self.buff.strip(): 
      # NEW: Split the buff at \r\n to get a list of JSON objects and iterate over them 
      json_obj = self.buff.split("\r\n") 
      for obj in json_obj: 
       if len(obj.strip()) > 0: 
        try: 
         tweet = json.loads(obj, encoding = 'UTF-8') 
        except Exception as json_ex: 
         print "JSON Exception occurred: %s" % str(json_ex) 
         continue 
+1

Danke !!! Ich schulde dir einen Drink, du hast meinen Stress gelöst! – vgoklani

Antwort

7

Versuchen Sie, Ihre ausgegebene Zeichenfolge in jsbeatuifier einzufügen.

Sie werden sehen, dass es tatsächlich zwei JSON-Objekte sind, nicht eins, das json.loads nicht behandeln kann.

Sie sind durch \r\n getrennt, so dass es leicht sein sollte, sie zu teilen.

Das Problem ist, dass das data Argument, das an on_receive übergeben wird, nicht notwendigerweise mit \r\n endet, wenn es einen Zeilenumbruch enthält. Wie dies zeigt, kann es auch irgendwo in der Mitte der Zeichenkette liegen, so dass nur das Ende des Datenblocks betrachtet nicht ausreicht.

+0

Dank Knospe, das hat perfekt funktioniert! Hinzufügen einer neuen Logik unter "Fixed Code" für Personen, auf die in Zukunft Bezug genommen werden soll. –

Verwandte Themen