Another way of debugging would be writing another UDF, returning string. Also, in that function, put something useful in catch block, so you can filter those records from df. On 9 Sep 2016 03:41, "Daniel Lopes" <dan...@onematch.com.br> wrote:
> Thanks Mike, > > A good way to debug! Was that already! > > Best, > > *Daniel Lopes* > Chief Data and Analytics Officer | OneMatch > c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes > > www.onematch.com.br > <http://www.onematch.com.br/?utm_source=EmailSignature&utm_term=daniel-lopes> > > On Thu, Sep 8, 2016 at 2:26 PM, Mike Metzger <m...@flexiblecreations.com> > wrote: > >> My guess is there's some row that does not match up with the expected >> data. While slower, I've found RDDs to be easier to troubleshoot this kind >> of thing until you sort out exactly what's happening. >> >> Something like: >> >> raw_data = sc.textFile("<path to text file(s)>") >> rowcounts = raw_data.map(lambda x: (len(x.split(",")), >> 1)).reduceByKey(lambda x,y: x+y) >> rowcounts.take(5) >> >> badrows = raw_data.filter(lambda x: len(x.split(",")) != <expected number >> of columns>) >> if badrows.count() > 0: >> badrows.saveAsTextFile("<path to malformed.csv>") >> >> >> You should be able to tell if there are any rows with column counts that >> don't match up (the thing that usually bites me with CSV conversions). >> Assuming these all match to what you want, I'd try mapping the unparsed >> date column out to separate fields and try to see if a year field isn't >> matching the expected values. >> >> Thanks >> >> Mike >> >> >> On Thu, Sep 8, 2016 at 8:15 AM, Daniel Lopes <dan...@onematch.com.br> >> wrote: >> >>> Thanks, >>> >>> I *tested* the function offline and works >>> Tested too with select * from after convert the data and see the new >>> data good >>> *but* if I *register as temp table* to *join other table* stilll shows *the >>> same error*. >>> >>> ValueError: year out of range >>> >>> Best, >>> >>> *Daniel Lopes* >>> Chief Data and Analytics Officer | OneMatch >>> c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes >>> >>> www.onematch.com.br >>> <http://www.onematch.com.br/?utm_source=EmailSignature&utm_term=daniel-lopes> >>> >>> On Thu, Sep 8, 2016 at 9:43 AM, Marco Mistroni <mmistr...@gmail.com> >>> wrote: >>> >>>> Daniel >>>> Test the parse date offline to make sure it returns what you expect >>>> If it does in spark shell create a df with 1 row only and run ur UDF. >>>> U should b able to see issue >>>> If not send me a reduced CSV file at my email and I give it a try this >>>> eve ....hopefully someone else will b able to assist in meantime >>>> U don't need to run a full spark app to debug issue >>>> Ur problem. Is either in the parse date or in what gets passed to the >>>> UDF >>>> Hth >>>> >>>> On 8 Sep 2016 1:31 pm, "Daniel Lopes" <dan...@onematch.com.br> wrote: >>>> >>>>> Thanks Marco for your response. >>>>> >>>>> The field came encoded by SQL Server in locale pt_BR. >>>>> >>>>> The code that I am formating is: >>>>> >>>>> -------------------------- >>>>> def parse_date(argument, format_date='%Y-%m%d %H:%M:%S'): >>>>> try: >>>>> locale.setlocale(locale.LC_TIME, 'pt_BR.utf8') >>>>> return datetime.strptime(argument, format_date) >>>>> except: >>>>> return None >>>>> >>>>> convert_date = funcspk.udf(lambda x: parse_date(x, '%b %d %Y %H:%M'), >>>>> TimestampType()) >>>>> >>>>> transacoes = transacoes.withColumn('tr_Vencimento', >>>>> convert_date(transacoes.*tr_Vencimento*)) >>>>> >>>>> -------------------------- >>>>> >>>>> the sample is >>>>> >>>>> ------------------------- >>>>> +-----------------+----------------+-----------------+------ >>>>> --+------------------+-----------+-----------------+-------- >>>>> -------------+------------------+--------------+------------ >>>>> ----+-------------+-------------+----------------------+---- >>>>> ------------------------+--------------------+--------+----- >>>>> ---+------------------+----------------+--------+----------+ >>>>> -----------------+----------+ >>>>> |tr_NumeroContrato|tr_TipoDocumento| *tr_Vencimento* >>>>> |tr_Valor|tr_DataRecebimento|tr_TaxaMora|tr_De >>>>> scontoMaximo|tr_DescontoMaximoCorr|tr_ValorAtualizado|tr_Com >>>>> Garantia|tr_ValorDesconto|tr_ValorJuros|tr_ValorMulta|tr_Dat >>>>> aDevolucaoCheque|tr_ValorCorrigidoContratante| >>>>> tr_DataNotificacao|tr_Banco|tr_Praca|tr_DescricaoAlinea|tr_ >>>>> Enquadramento|tr_Linha|tr_Arquivo|tr_DataImportacao|tr_Agencia| >>>>> +-----------------+----------------+-----------------+------ >>>>> --+------------------+-----------+-----------------+-------- >>>>> -------------+------------------+--------------+------------ >>>>> ----+-------------+-------------+----------------------+---- >>>>> ------------------------+--------------------+--------+----- >>>>> ---+------------------+----------------+--------+----------+ >>>>> -----------------+----------+ >>>>> | 0000992600153001| |*Jul 20 2015 12:00*| 254.35| >>>>> null| null| null| null| >>>>> null| 0| null| null| null| >>>>> null| 254.35|2015-07-20 12:00:...| >>>>> null| null| null| null| null| null| >>>>> null| null| >>>>> | 0000992600153001| |*Abr 20 2015 12:00*| 254.35| >>>>> null| null| null| null| >>>>> null| 0| null| null| null| >>>>> null| 254.35| null| >>>>> null| null| null| null| null| null| >>>>> null| null| >>>>> | 0000992600153001| |Nov 20 2015 12:00| 254.35| >>>>> null| null| null| null| >>>>> null| 0| null| null| null| >>>>> null| 254.35|2015-11-20 12:00:...| >>>>> null| >>>>> null| null| null| null| null| >>>>> null| null| >>>>> | 0000992600153001| |Dez 20 2015 12:00| 254.35| >>>>> null| null| null| null| >>>>> null| 0| null| null| null| >>>>> null| 254.35| null| >>>>> null| >>>>> null| null| null| null| null| >>>>> null| null| >>>>> | 0000992600153001| |Fev 20 2016 12:00| 254.35| >>>>> null| null| null| null| >>>>> null| 0| null| null| null| >>>>> null| 254.35| null| >>>>> null| >>>>> null| null| null| null| null| >>>>> null| null| >>>>> | 0000992600153001| |Fev 20 2015 12:00| 254.35| >>>>> null| null| null| null| >>>>> null| 0| null| null| null| >>>>> null| 254.35| null| >>>>> null| >>>>> null| null| null| null| null| >>>>> null| null| >>>>> | 0000992600153001| |Jun 20 2015 12:00| 254.35| >>>>> null| null| null| null| >>>>> null| 0| null| null| null| >>>>> null| 254.35|2015-06-20 12:00:...| >>>>> null| >>>>> null| null| null| null| null| >>>>> null| null| >>>>> | 0000992600153001| |Ago 20 2015 12:00| 254.35| >>>>> null| null| null| null| >>>>> null| 0| null| null| null| >>>>> null| 254.35| null| >>>>> null| >>>>> null| null| null| null| null| >>>>> null| null| >>>>> | 0000992600153001| |Jan 20 2016 12:00| 254.35| >>>>> null| null| null| null| >>>>> null| 0| null| null| null| >>>>> null| 254.35|2016-01-20 12:00:...| >>>>> null| >>>>> null| null| null| null| null| >>>>> null| null| >>>>> | 0000992600153001| |Jan 20 2015 12:00| 254.35| >>>>> null| null| null| null| >>>>> null| 0| null| null| null| >>>>> null| 254.35|2015-01-20 12:00:...| >>>>> null| >>>>> null| null| null| null| null| >>>>> null| null| >>>>> | 0000992600153001| |Set 20 2015 12:00| 254.35| >>>>> null| null| null| null| >>>>> null| 0| null| null| null| >>>>> null| 254.35| null| >>>>> null| >>>>> null| null| null| null| null| >>>>> null| null| >>>>> | 0000992600153001| |Mai 20 2015 12:00| 254.35| >>>>> null| null| null| null| >>>>> null| 0| null| null| null| >>>>> null| 254.35| null| >>>>> null| >>>>> null| null| null| null| null| >>>>> null| null| >>>>> | 0000992600153001| |Out 20 2015 12:00| 254.35| >>>>> null| null| null| null| >>>>> null| 0| null| null| null| >>>>> null| 254.35| null| >>>>> null| >>>>> null| null| null| null| null| >>>>> null| null| >>>>> | 0000992600153001| |Mar 20 2015 12:00| 254.35| >>>>> null| null| null| null| >>>>> null| 0| null| null| null| >>>>> null| 254.35|2015-03-20 12:00:...| >>>>> null| >>>>> null| null| null| null| null| >>>>> null| null| >>>>> +-----------------+----------------+-----------------+------ >>>>> --+------------------+-----------+-----------------+-------- >>>>> -------------+------------------+--------------+------------ >>>>> ----+-------------+-------------+----------------------+---- >>>>> ------------------------+--------------------+--------+----- >>>>> ---+------------------+----------------+--------+----------+ >>>>> -----------------+----------+ >>>>> >>>>> ------------------------- >>>>> >>>>> *Daniel Lopes* >>>>> Chief Data and Analytics Officer | OneMatch >>>>> c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes >>>>> >>>>> www.onematch.com.br >>>>> <http://www.onematch.com.br/?utm_source=EmailSignature&utm_term=daniel-lopes> >>>>> >>>>> On Thu, Sep 8, 2016 at 5:33 AM, Marco Mistroni <mmistr...@gmail.com> >>>>> wrote: >>>>> >>>>>> Pls paste code and sample CSV >>>>>> I m guessing it has to do with formatting time? >>>>>> Kr >>>>>> >>>>>> On 8 Sep 2016 12:38 am, "Daniel Lopes" <dan...@onematch.com.br> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I'm* importing a few CSV*s with spark-csv package, >>>>>>> Always when I give a select at each one looks ok >>>>>>> But when i join then with sqlContext.sql give me this error >>>>>>> >>>>>>> all tables has fields timestamp >>>>>>> >>>>>>> joins are not with this dates >>>>>>> >>>>>>> >>>>>>> *Py4JJavaError: An error occurred while calling o643.showString.* >>>>>>> : org.apache.spark.SparkException: Job aborted due to stage >>>>>>> failure: Task 54 in stage 92.0 failed 10 times, most recent failure: >>>>>>> Lost >>>>>>> task 54.9 in stage 92.0 (TID 6356, yp-spark-dal09-env5-0036): >>>>>>> org.apache.spark.api.python.PythonException: Traceback (most recent >>>>>>> call last): >>>>>>> File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/ >>>>>>> lib/pyspark.zip/pyspark/worker.py", line 111, in main >>>>>>> process() >>>>>>> File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/ >>>>>>> lib/pyspark.zip/pyspark/worker.py", line 106, in process >>>>>>> serializer.dump_stream(func(split_index, iterator), outfile) >>>>>>> File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/ >>>>>>> lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream >>>>>>> vs = list(itertools.islice(iterator, batch)) >>>>>>> File >>>>>>> "/usr/local/src/spark160master/spark/python/pyspark/sql/functions.py", >>>>>>> line 1563, in <lambda> >>>>>>> func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), >>>>>>> it) >>>>>>> File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/ >>>>>>> lib/pyspark.zip/pyspark/sql/types.py", line 191, in toInternal >>>>>>> else time.mktime(dt.timetuple())) >>>>>>> *ValueError: year out of range * >>>>>>> >>>>>>> Any one knows this problem? >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> *Daniel Lopes* >>>>>>> Chief Data and Analytics Officer | OneMatch >>>>>>> c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes >>>>>>> >>>>>>> www.onematch.com.br >>>>>>> <http://www.onematch.com.br/?utm_source=EmailSignature&utm_term=daniel-lopes> >>>>>>> >>>>>> >>>>> >>> >> >