Find Non Ascii Characters In Text File Notepad Plus
You're joking? Windows 7 file/folder search is, at least, the most useless bit of an underdeveloped search parameter engine since windows 3.1. Non-ASCII text highlighter plugin. Forum: [READ ONLY] Open Discussion. Creator: maxholloway. Created: 2013-06-11. Updated: 2013-06-19. Maxholloway - 2013-06-11. I need a plugin that will search a text file and identify non-ASCII characters. If you would like to refer to this comment somewhere else in this project, copy.
Hi, I need a routine that can quickly detect if a string contains any non-ascii characters which comes from field corruption. Below is a copy-paste of a string that is killing ODBC. I'd like to test it before handing it over to SPT. These strings can be up to 4000 characters long and there are millions of records, therefore iterating character by character through the string and testing each character for non-ascii would work, but highly impractical because it would take forever. Actually, I'm getting about 100 of such corrupted fields for every 200,000 records. Any suggestions? Dfi Lanparty Nf4 Drivers Windows 7.
Thanks, Stanley Actual data from a failed command. I am not sure exactly what you mean by 'ASCII characters' since the definition of 'ASCII characters' includes 128 different characters of which a large percentage of them are not visible, like the BELL character which is supposed to chime a bell.
And NONE of the lower case letters are included in that definition. Please definitively define exactly what you mean so we can point you in the correct direction. Definition: ASCII: A code that assigns the numbers 0 through 127 to the letters of the alphabet, the digits 0 through 9, punctuation marks, and certain other characters. For example, the capital letter A is coded as 65 (binary 1000001).
By standardizing the values used to represent written text, ASCII enables computers to exchange information. Basic, or standard, ASCII uses seven bits for each character code, giving it 27, or 128, unique symbols.
Various larger character sets, called extended ASCII, use eight bits for each character, yielding 128 additional codes numbered 128 to 255. That said, if there is an unwanted character that occurs EVERY time in the corrupted field but NEVER in the good fields, the simplest way I know would be to use something like this where ' n' is the ASCII code for the unwanted character: IF CHR(n) $ field DO something ENDIF mmerlinn Poor people do not hire employees. If you soak the rich, who are you going to work for?
'We've found by experience that people who are careless and sloppy writers are usually also careless and sloppy at thinking and coding. Answering questions for careless and sloppy thinkers is not rewarding.' - Eric Raymond RE: Function to Detect Non Ascii Characters Needed (Programmer). While doing some tests the hex editor is showing them as 00, and while copy and paste into vfp's command window, they show up as? I copied this from the file: 52011. and pasting this into the command window I get this: zz='52011.???????????'
?CHR(00) $ zz (returns.F.) Here is what it looks like in the hex editor: 35 32 30 31 31 2E 20 20 20 20 20 20 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 RE: Function to Detect Non Ascii Characters Needed (Programmer) 22 Sep 12 03:49. CODE * Build a string containing all the non-ASCII chars lcLowChars = ' FOR lnI = 0 TO 31 lcLowChars = lcLowChars + CHR(lnI) ENDFOR lcLowChars = lcLowChars + CHR(127) * Assume lcInput contains the original text * Strip out the non-ASCII chars lcStripped = CHRTRAN(lcInput, lcLowChars, ') IF lcStripped lcInput * The original text does contain non-ASCII chars ELSE * It doesn't ENDIF Hope this makes sense. Mike __________________________________ Mike Lewis (Edinburgh, Scotland) RE: Function to Detect Non Ascii Characters Needed (Programmer).
Indeed ASCII chars include both readable and nonreadable chars, as Mike said control characters. CODE * init phase LOCAL lcUnwantedChars For lnCode = 0 to 31 lcUnwantedChars = lcUnwantedChars+chr(lnCode) EndFor * loop over data to insert If LEN(CHRTRAN(lcSQLString,lcUnwantedChars,'))=LEN(lcSQLString) * OK, send SQL SQLEXEC(.) Else * unwanted chars found Endif * endloop That said there are situiations such data is not corrupted, eg wehan it's binary data, or encoded/encrypted data. Any table, also in DBFs can also contain such 'control characters', but it's not control characters, as it's not ASCII data anyaway, it's bytes having meaning. I can see this is not the case in your sample, for example a field you insert into called IMAGE_PATH should perhaps not have such a binary data string, but a readable path string. Nevertheless you should watch out what field types are and what data in them is ok. If there are binary fields a mechnaism to create literal INSERT INTO statement strings is not the way to go to transfer data to a backend. For bulk insert of data go for CSV, XML, anything you can BULK INSERT from the SQL Server perspecitve, executing a massive list of insert statements is having it's limit with binary data and performance, it's the least attractive solution to insert data.
I know for example a very popular MySQLDump does create scripts of inserts as a backup of data, this is really quote an amateur solution, but has it's pro side on working for normal data and from a client side without having access to directories you would need to put in CSV or XML files to be read in more elegantly. Parameters are again a way to also bring binary data over ODBC, as these are transferred in parallel. The client side ODBC driver will already parse out parameters spoecified by?param and forward the variable or properties or fields specified by their name in a seperate stream to the server side ODBC endpoint. This also has many advantages in case of transferring datetime values. You don#t have to care finding out a format the remote database understands, ODBC will take in a foxpro datetime value and it will arrive correct as an SQL Server datetime. RE: Function to Detect Non Ascii Characters Needed (Programmer).
Hi Olaf, >>I can't of course say this is the cood to loook for, the string also might be composed without using textmerge, but you get the idea. It's not the final strings you should analyse, it's the source data, that should not contain unwanted codes. It was composed with textmerge's text-endtext. You do make a valid point of dealing it at the field level and at least I can empty the field and odbe would be happy. Plus that data is bad anyway. I'll create a script that iterates through all the fields in the vfp table emptying bad fields and logging. >>If there are binary fields a mechnaism to create literal INSERT INTO statement strings is not the way to go to transfer data to a backend.
How would you transfer binary data to the backend? With a?param???
Please explain. Thanks for the?param idea. I have used it before, but was not totally aware of what was going on in the background. Thanks fro bringing that back up. I still need to make a vfp data sanitizer routine that replaces bad data fields with something appropriate that indicates bad data while keeping odbc happy. Thanks, Stanley RE: Function to Detect Non Ascii Characters Needed (Programmer) 22 Sep 12 05:12. >How would you transfer binary data to the backend?
With a?param??? Yes, with a param. What should I explain? You can store a binary into a variable, depending on the source. Also you can pass?alias.field from a dbf having a binary field,?
Is not limited to variables, you can also pass fields, properties, any name. You will just fail on passing an object as in?_screen or?loObject, that won't be passed to SQL Server, but that limitation is acceptable, isn't it. You can still use the CHRTRAN code to identify unwanted chars and you can do that on the overall level or with the seperate field values. And then you can sanitize those with empty values or default values, you might also pass NULL either via param or as string. Is there still an open question?
RE: Function to Detect Non Ascii Characters Needed (Programmer) 22 Sep 12 14:25. CODE 35 32 30 31 31 2E 20 20 20 20 20 20 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 But you also say that CHR(00) $ zz returns.F. Are you about that? The hex editor shows the actual ASCII values, and the data in your example definitey contains ASCII 0.
If CHR(0) really returns.F., you should find out what all the 00s in the data actually contain. ASC(RIGHT(lcData, 1)) If it is 0, then your CHR(0) $ zz is all you need. If it isn't zero, find out what it is, and whether that particular value is always present in the non-ASCII characters in your data.
If it is, test for that value in place of CHR(0). There should be no performance issues with that, despite the large amount of data involved. Mike __________________________________ Mike Lewis (Edinburgh, Scotland) RE: Function to Detect Non Ascii Characters Needed (Programmer) 22 Sep 12 14:51. Stanlyn, If I copy 52011. from your post I also get question marks in VFP. But that's just because whatever editor you used to open the file containing 0 bytes represents them with , which in itself is a char neither in the ansi codepage nor being chr(0).
Chr(0) in vfp and nothing prints on the screen, not even . So if you copy from an editor you at least have the translation of the file to the editor, of the edit to the clipboard and of the clipboard to the vfp editor. If you want to read the file, then go the straight way, not via some editor, neither notepad, nor wordpad nor any other. The hex editor shows you the truth for sure. Check via lcFile = FILETOSTR(getfile())?
Chr(0) $ lcFile There are really 0 bytes in there. RE: Function to Detect Non Ascii Characters Needed (Programmer). The answer I wrote on 20 minutes is lost. Ok recap in short: Stanlyn, I already answered most of this. In regard to EmEditor this means what I also already said, it's one of those editors displaying nonprintable chars with a box char. Which in turn foxrpo can't display and turns to? Foxpro string VARIABLES can contain any char, also chr(0), but not string literals, anything you put in source code between string delimiters.
A way to get chr(0) inside a string variable is to use chr(0) in code. TEXTMERGE has the same prroblem as source code you put together a string variable, but you could also see that as merging into a textfile, which then is read into a string. And it's a source code textfile.
Like you can't put chr(0) into an editor as a single char instead of the expression chr(0), Textmerge can't, too. Think of chr(13) = carriabge return foir that matter. If that is inside a field of the update table you create code containing a linebreak within a string.
Medical Physiology Boron Pdf Torrent. You thre in the term 'sanitize'. If you really know what it means, then you should know it's there to prevent such problems, not only a single quote char inside data would be a problem, also control chars like chr(13) or chr(0) or anything unprintable are a problem you need to care about, ODBC does blindly forward binary cahrs too.
So you won't have the problem of the sqlexec to fail, but it will not filter unwanted data away. My main point is the CHRTRAN, for the third time. The parametisation of SQL is on top of that to optimise performance and to avoid technical problems you have with text merging.
ODBC would pass a string as is via parameter, and the sql command itself will contain a placeholder parameter name. This is translated by ODBC mechanisms beyond the scope of this discussion. Read books about that. ODBC will also translate to ODBC types, but as I already said it will even work with binary data, so it will not interfere with translating text code pages or such.
It will translate eg foxpro integer to ODBC integer, but their value ranges are identical, so no harm, the same goes for most any other stuff. You'll see if datetime arrives 1:1 or is rounded to seconds instead of milliseconds. But the concerns of ODBC type translations are much less than the problems of textmergin partially binary or corrupted data into a sqlcommand string. VFP is even not standing in the way, it's SQL Server not accpeting nonprintable chars within a sql command string. 'Insert into table (stringfield) Value ('stringvalue')' is not supposed to have any binary data as stringvalue and SQL Server prevents that. So if you want to bring in that data then pass it via parameters. But if your main point is about filtering that data out, I point you towards CHRTRAN for the third time now.
This whole thing about Parameterisation is good and helps preventing technical problems, exceptions and errors, but without filtering will just transfer the gibberish into sql server. It's still recommended to avoid sqlinjections, no matter if they are just randomly occurring or planned. If you have a problem, that might bne because of the scope. You have code now that does a textmerge, if the variables you use their are not in scope while you do the sqlexec the simple change to a parameterised sql command string will not be sufficient, you also will need to either do the sqlexec iun the same place or make variable available as eg object properties.
RE: Function to Detect Non Ascii Characters Needed (Programmer) 23 Sep 12 04:50. Stanley, you are making this so much harder than it actually is. We have already established that you have one or more instances of unwanted CHR(0) in your data and it appears that all corrupted data has that character in it. Just go back to the place where you were going to loop through your field looking for bad characters. Then instead of looping through the field, do as has been suggested twice before: IF CHR(0) $ yourfield DO something ENDIF Unless there is something that you have not told us, this should solve your original problem. If for some reason it does not, then come back with up-to-date information to help us out. Mmerlinn Poor people do not hire employees.
If you soak the rich, who are you going to work for? 'We've found by experience that people who are careless and sloppy writers are usually also careless and sloppy at thinking and coding.
Answering questions for careless and sloppy thinkers is not rewarding.' - Eric Raymond RE: Function to Detect Non Ascii Characters Needed (Programmer) 23 Sep 12 05:38.
I'm with both of you that solving the filter problem is the main issue. Sorry, that I distracted from it. But even if chr(0) is in one or more samples of corrupted data, I strongly would advise to check for other unwanted chars, too. The check for chr(0) might filter out 99% of the cases, but if you can define more unwanted chars or also the wanted and accepted chars, you can chek for them with CHRTRAN(), see my first post here. RE: Function to Detect Non Ascii Characters Needed (Programmer) 23 Sep 12 06:02. To all, Sorry about all the confusion, of which I am more confused now that when I first started. Let me try to explain.
This thread started out with a question about (1) quickly determining if a string has bad/unprintable characters in it. Then we start talking about (2) textmerge and parameters, then we turned to (3) all the translations that are possibly involved, and finally (4) filtering. The benefit for me is learning much more than what I originally asked for. If you see a better way of doing things, then please, DO go off on a tangent. I get the part about 'IF CHR(0) $ yourfield', but that would take forever to process. I mentioned this fact at the top of this thread. At this point we are thinking its chr(0), (has anyone confirmed that yet), and what about all the other possibilities that are unprintable, as that only adds to the processing time.
Early in this thread, I stated that the fastest method would be to deal with it once odbc complains, then look through the offending record's fields using Mike's code. No one has seconded this motion yet. I don't think that the variable Mike creates contains all possible bad characters and or corruption that may be found in the string, after assuming binary data has been kept out as binary data would certainly contain a much larger list of garbage characters that odbc would complain about. I can easily handle the testing of each field on either the original table or the intermediate table as I know what data type each is to contain. TEXTMERGE has the same prroblem as source code you put together a string variable, but you could also see that as merging into a textfile, which then is read into a string. And it's a source code textfile. >>The parametisation of SQL is on top of that to optimise performance and to avoid technical problems you have with text merging.
ODBC would pass a string as is via parameter, and the sql command itself will contain a placeholder parameter name. This is translated by ODBC mechanisms beyond the scope of this discussion. Olaf, I'm understanding you as saying that I should not use textmerge, instead I should use?params some other way. I thought that I have to use textmerge as the assembler of the sql command that will be passed thru to odbc and finally to sql. If you are NOT using textmerge with your?params to do it the way you have suggested, then be specific on how to do this as I don't have a clue.
Earlier, I asked, with no direct confirmation. So, in an textmerge sql update statement, Instead of: [sync_time] = '>', [is_deleted] = >I'd use: [sync_time] =?wd, [is_deleted] =?lnIsDeleted Yes or No? If no, please explain. >>If you have a problem, that might bne because of the scope. Cannot imagine it could be scope because it processes 199,900 without issue out of 200000 on average, (100 errors out of 200,000).
As a result of this thread, I'm leaning toward (1) creating a sanitize script that processes each VFP table looking for bad and data corruption, either nulling or whatever to indicate bad data was there as well as logging them. Then (2) send it SQL via sqlexec trapping errors associated with bad data, and output the string to a file for further investigation. Thanks, Stanley RE: Function to Detect Non Ascii Characters Needed (Programmer) 24 Sep 12 03:53. Stanley, This demonstrates the difficulties of using the same thread for several themes (albeit related).
Let's focus for a moment on the original question - and, in my opinion, the simplest solutions. You say: I get the part about 'IF CHR(0) $ yourfield'. Top of this thread. At this point we are thinking its chr(0), (has anyone confirmed that yet). We are waiting for you to confirm that.
You have the data. The question is: Can you rely on the fact that, where the data is corrupted, it will always contain a binary 0? If yes, then use the $ or AT() function.
It will instantaneous. End of story. If no, then use the code I originally posted. You say the objection is that it will 'take forever'. I know how much data there is, but the processing time is likely to be very small compared to the time taken to retrieve the data in the first place. You can make a couple of improvements to my code to speed it up: 1.
Re the loop that's building the lcLowChars string: instead of doing this each time you do the test, do it once only, at initialisation, and keep lcLowChars in scope after that. (You probably already realised this.) 2. Instead of testing lcStripped lcInput, test LEN(lcStripped) LEN(lcInput). Comparing two integers is clearly much faster than comparing two long strings. That still leaves the CHRTRAN(). I realise that involves iterating a very long string. But that's done in low-level, highly-optimised code.
I haven't tried to time it, but I believe it would be extremely fast, even with the volumes that you are working with. Finally, you say: 'I don't think that the variable Mike creates contains all possible bad characters and or corruption that may be found in the string'. It's up to you to determine what constinutes a 'bad character or corruption'. At the outset, we talked about non-ASCII characters. If there are other characters that you know always result from corruption, then just add those to the lcLowChars string.
Similarly, if you know there are any non-ASCII characters that will never occur in the corruption, remove those from the string. I'll leave to others to discuss the ODBC and TEXTMERGE issues.
Whatever approach you finally take, I hope you manage to find a solution. Mike __________________________________ Mike Lewis (Edinburgh, Scotland) RE: Function to Detect Non Ascii Characters Needed (Programmer) 24 Sep 12 03:57. Thanks, Stanlyn That get's several points straight and gives this thread a better aim again. As a side note: If you this code is in use for a long time and you encountered the problematic chars just this time, the very simple explanation may be a file upload failed and just would need to be repeated, to eg get a uncorrupt UPLOAD.dbf. Nobody actually asked, if your code is old and just having problems with current input files or if the process is new and yet to be established.
I think we should address the detection of bad data first, then straighten the textmerge usage. So in this post I'll address the garbage problem. You still didn't get that my CHRTRAN code will care for chr(0) and other chars in one go(!). Please reread my first post, especially it's code section. You see I add 32 chars as unwanted chars into a variable lcUnwantedChars, from chr(0) to chr(31).
You may take out chr(9), chr(13) and chr(10), as they are TAB, CR and LF, not priuntable but surely allowed control characters. This may again cause a textmerge problem, but that topic again is postponed. So when I talked of filtering (4) binary data, that was not meant as a new step to take, it merely was another word to (1) quickly determining if a string has bad/unprintable characters in it, so topic 4 = topic 1.
Your points 2 and 3, the textmerge problem and 4 the translation of codepages will be addressed later. 4 is not a problem once you read in your data in ways not doing a translation, but all the explanations about translations only were made to show you how you get from your original file containing chr(0) to? As another exercise do _CLIPTEXT = chr(0) and then try to paste that in VFP. Nothing will get pasted, not a? Nor a box char, the text cursor isn't even moving a char. In contrast do STROTOFILE(chr(0),'test.bin') and then lcFilecontent = FILETOSTR('test.bin') and? OK, now for the filtering /detecting unwanted chars: In short the above code will help you.
It's not true that it 'would take forever to process' even if you do it on all single fields instead of the sql command string after you put that together, as putting together the sql command string also takes time. You can simply loop from 1 to fcount() over field(i) and check with chrtran for unwanted chars, if the field type is cahracter, you don't even need to hardcode field names, you can test the type with TYPE(Field(i)), so you can make this generic code. You can also invert the chrtran logic, and create a variable lcWantedChars to contain all allowed chars, Then check if chrtran(field,allowedchars,') results in an empty string, because if there is something remaining it wouldn't be a wanted char.
Don't forget to include space and other seldom but valid chars. As soon as you have this and report back what you find, I will address the textmerge. As a first answer: Yes, that is the syntax, simply?fieldname or?variablename. You had that right, so I didn't answer because you had that right, I'm still wondering why you don't simply let foxpro answer that question? Simply do that and you will see. The paragraph about variable scope was not about your current code, but the parameterised insert, I thought you had already changed and was just addressing problems you could have with it, as with parameterised queries the scope of the passed variables is of importance, not with your current textmerge. With your current textmerge you just have the values in the resulting string, the variables coul be lost afterwards, it wouldn't matter, as their values are embedded, copied into the sql comand strin via textmerge.
Enough, this is delayed. Please first address the filtering/detection of unwanted chars and come back, if you have questions about the solution. RE: Function to Detect Non Ascii Characters Needed (Programmer) 24 Sep 12 04:48. My first line of attack would be the test for CHR(0) as I suggested in my last post.
That should solve most if not all of the problem that Stanley originally presented to us. That would also get the pressure off so he can solve the real problem without constantly being sidetracked with fires to put out. If perchance the first line of attack did not solve the whole problem, it would reduce the size of problem set and make it easier to locate other characters that could be giving problems.
After removing the pressure cooker, I would go looking for the issue of WHERE the corrupted data is coming from. Solving the cause of the problem would stop any future issues. Basically, I see this as two issues. Remove as many distractions as possible, then solve the underlying problem. Mmerlinn Poor people do not hire employees.
If you soak the rich, who are you going to work for? 'We've found by experience that people who are careless and sloppy writers are usually also careless and sloppy at thinking and coding.
Answering questions for careless and sloppy thinkers is not rewarding.' - Eric Raymond RE: Function to Detect Non Ascii Characters Needed (Programmer). Hi all, >>if your code is old and just having problems with current input files or if the process is new and yet to be established. The data has been accumulating since 1998 and is comming from 7 different locations, so how the corruption got there is anyone's guess. >>You still didn't get that my CHRTRAN code will care for chr(0) and other chars in one go(!).
Yes Olaf, I did get that, sorry for not communicating that to you. I mentioned using Mike's suggestion, but I saw that your suggestion actually expands on his to include many chars.
I was in doubt that this alone will catch all possible 'non-printable characters'. Olaf, I'll do your exercises and get back. Assuming all the errors are fixed, I'd still like to see you take on the textmerge vs?params I asked about. More later today, Stanley RE: Function to Detect Non Ascii Characters Needed (Programmer) 24 Sep 12 11:08. >The data has been accumulating since 1998 and is comming from 7 different locations, so how the corruption got there is anyone's guess. Then why don't you simply ask for a reupload of this days/weeks/moths data. Removing corrupted records is not solving the problem really, is it?
If the data got corrupted in an upload removing the records not readable still means missing lots of data. I would actually not care for the error then and redo with correct data. What you're doing overall here may add to the stability of the import process, but you'd also not take the uncorrupt pixel rows of an image, you'd discard it overall and ask for resubmission. RE: Function to Detect Non Ascii Characters Needed (Programmer) 24 Sep 12 12:30. I mentioned using Mike's suggestion, but I saw that your suggestion actually expands on his to include many chars. I was in doubt that this alone will catch all possible 'non-printable characters'.
Both bits of code - Olaf's and mine - handles all non-printable characters, and any other characters that you wan to include or exclude. Remember, non-printable characters are those in the range 0 - 31, and also 127. (Admittedly, that also includes line-feeds, carriage returns, and tabs, which are probably legitimate characters, but those can easily be excluded.) Mike __________________________________ Mike Lewis (Edinburgh, Scotland) RE: Function to Detect Non Ascii Characters Needed (Programmer). Hi Dan, >>The only thing data-specific about that code is the name of the database. Substitute one of your own. Well, thats the point of using Adbobjects() which generates the tables, and afields() that generates the field names.
They iterate thru the database generating the table names along with their field names. Now that I have the table and field names, I need to test the value of each field to see if it contains chr(0) or whatever. How do I test the data in those fields? Thats the part I'm missing. I'm also know that I will need to also test the field for character types as in gaFields(x, 2). Thanks, Stanley RE: Function to Detect Non Ascii Characters Needed (Programmer) 4 Oct 12 19:03. CODE -->if Chr(0) $ (gaFields(x, 1)) always returns.F.
Because the chr(0) is not contained in what (gaFields(x, 1)) returns which is the name of the field, NOT the value contained inside the field. How do I get to the data (values) inside the fields.
I originally tried what I commented out and after seeing that it was only returning the field names in the array and not the data. I've also tried several variations of the array elements in the debugger trying to find the data and I don't see it anywhere. I've also looked at all the array commands and the only one that looked like it may be used is the ascan() command.
My code above shows me attempting to use it, but I kept getting the data type error on the chr(0) part of the command. Here is the original code. Hi Mike, >>So, Stanley, are you saying that the original problem has finally been solved? Not yet, but this will help and/or eliminate the original problem. Right now I'm saying your last suggestion on getting at the data from the afields() function is what I needed to complete a Sanitize.prg that will be run on the vfp tables looking for any non-printables. I have it working now using only the chr(0). Next, I'm adding the full unprintable char(0-31) less 10, 13, and some others.
Thanks again Mike, Stanley Here is what it looks like now. CODE -->Clear Close Tables All lnTables = Adbobjects(gaTables, 'TABLE') For i = 1 To lnTables?' TABLE: ' + gaTables(i) Use (gaTables(i)) In 0 Select (gaTables(i)) lnFields = Afields(gaFields) For x = 1 To lnFields?gaFields(x, 1) If (gaFields(x, 2)) = 'C' && test field type Go Top Do While!Eof() If Chr(0) $ Eval(gaFields(x, 1))?(gaFields(x, 1))?Eval(gaFields(x, 1)) replace (gaFields(x, 1)) WITH 'BAD' Endif Skip 1 Enddo Endif Endfor Use In Select(Dbf()) Wait Endfor RE: Function to Detect Non Ascii Characters Needed (Programmer) 5 Oct 12 05:28. Stanley, Glancing at your code, it seems there is a big inefficiency in it. If I've understood it right, you are looping through an array containing a list of the tables in the database. For each table, you are looping through the list of fields.
And for each field, you are looping through each record in the table. I think it would be better to reverse the nesting of the final two loops. In other words, loop through the records in the table.
For each record, loop through the fields. That way, you only have to do one pass per table, rather than one pass per field per table. One other small point: Do any of the tables contain memo fields? If so, then you need to look for 'M' as well as 'C' when testing the field type. Mike __________________________________ Mike Lewis (Edinburgh, Scotland) RE: Function to Detect Non Ascii Characters Needed (Programmer) 5 Oct 12 10:14. As Mike pointed out, your code is performing weak. Even the code I gave you early on (If LEN(CHRTRAN(lcSQLString,lcUnwantedChars,'))=LEN(lcSQLString)) would rather inspect the final code you put together and thereby also test all fields of a record before executing it's insert.
Even that would be faster than your current idea. Now, if you want to find 0 bytes in your whole database another more holistic approach would be much more efficent: Reading DBFs via low level file functions. You need to skip the bytes belonging to non char fields, which is not straight forward, but also not impossible. And a simple way to prevent evaluating those fields is to extract all char fields into a seperate dbf and then you can scan the whole DBF file except header bytes in one go, without reading in single field values. There is the HEADER() function, which will tell you how many bytes of a DBF file to skip to get to the first record, then it's RECSIZE() bytes for each record. If records contain MEMOs, the DBF will have 4 byte integers representing an offset in the FPT file associated with the DBF, you can care for that seperately, but these 4 byte pointers need to be skipped as you need to skip integers or numeric fields, date and dateteim fields and others, too.
We can also do approaches, that will not tell you exact records corrupt, but just alert a corrupt dbf in the sense of unwanted chars present. So the question is your overall goal: Would you only like to skip single corrupt records or find overall corrupt dbf files? I already warned you partially inserting data coming in is not a good strategy. If a file is sent in via ftp or mail, for example, and for some reason is corrupted in some block of the file, you will be better of having that whole file resent than partially inserting the data coming in correctly. If you have corruptions due to network failures, you will rather find bursts of lengthy blocks of 0 bytes than anything else. So depending on that you could also use a rough estimate algorithm not searching for chr(0) in single fields, but for replicate(chr(0),32) $ filetostr(dbffilename), that would be much much faster finding out such larger bursts of 0 bytes, which would only seldom occur as correct data, eg in tables with many integer fields all being 0. You have the broken files, I would suggest you open one of them up in a hexeditor, you have at least the hexedit.prg coming with vfp to do that, and then look out for bursts of 0 bytes in the dbfs and see if that is a sign you find in them.
Also, if this is all about finding out the transfer of a file succeeded, there are much easier mechanisms with checksums on file level. You won't analyse the inner workings of a file to check for file transfer errors. You would add md5 checksums in one additional file to send and check them against md5 checksums computed from the files directly.
Another very simple approach is to send zips, as zips can be created programmatically and extracted programmatically and would fail to be extracted, if they are not transferred intact, which wuold also simply tell you when to do retransfers. RE: Function to Detect Non Ascii Characters Needed (Programmer). Hi all, For additional clarification.
I ran into this chr(0) issue while developing a synchronizing tool that syncs the vfp data to sql at the record level. Therefore, I'm only interested in the records that fails to sync for whatever reason. I can then deal with them, and for now its the chr(0) and whatever else that could be in a numeric field. In the problem records where chr(0) was found in almost all char fields, the numeric fields has some sort of strange character. I'll test for this later. Also in this group one char field is OK, and everything else has un-printables. At this point I'm not transferring them via ftp, htm or any thing else, as I got complete copies of the databases from each location and I'm processing them locally for now (while developing and testing.
I do not know how this select few of records got the chr(0) in-bedded in them. The code above that uses the afields() function was written as a simple utility that can sanitize the vfp database from this chr(0) stuff and can be executed as needed and was not intended to be part of the app. I do expect to use some of the bits and pieces talked about here in the final app.
And of-course the chr(0) must be expanded to include other issues related to bad data. Thanks for the 'M' data type tip, as I have not yet gotten to it yet. Now that I can interrogate the underlying data, I will add the other bad data options. In testing the numerics I spoke of above, it turns out they were not numerics, instead they are logicals. It is displaying the data strangely as a fat I, but when printing the value, it prints as.F. Also note that its only the records whose char fields contains mostly chr(0)s.
Here is a screenshot. Any suggestion? Stanley • RE: Function to Detect Non Ascii Characters Needed (Programmer). Stanlyn, at this stager it's totally clear your copies are just not correct. Don't try to heal a dead horse. All the code you work on now can later be replaced by checking checksums and rejecting wrong copies. No matter, if they come in from manual file copies or FTP or HTTP.
Your file is obviously not copied correctly or the original file already has these errors. It's unusable data.
To understand the issue with your file, look at it in a hex editor. In the uncorrupted fields, where foxpros browse displays.T. Or perhaps.NULL. (if the field is nullable) or perhaps space (for blank logical fields), in the hex editor (DO Home()+'Tools/Hexedit/Hexedit.prg') you'll see T and F and space (and null is denoted as a set bit in the last byte(s) of the record, but that's not really important here). And this means any other char than T, F or space is causing the browse to display nonsene, while accessing the field it's printed as.F., because it is not T. So VFP doesn't really care if you store F to display.F.
But that's just details. A blind person could see these records are corrupted. So simply redo copying these databases.
It will not help you to skip corrupt records or fields, if you sync from such a corrupted table the data you insert is not complete and therefor even the correct records would not help you. Eg if a parent record of an order is corrupt you can't insert the orderdetail records, even if they are correct. Don't try to mend this problem, if it's just because of corrupt file copy.
You can come back to this, if a further copy of the data shows the same corruptions again, so the source data is already corrupted. But then you even need further code to save intact data of records belonging to each other by database relations and that again needs a more holistic approach. At the moment you're stuck at a detail, that will probably not be a problem ever again, unless source data is already corrupt. >I got complete copies of the databases from each location The thing to do for the databases not importing 100% is to ask for complete copies again. Then see, if the secondary copies also show the same corrptions. It's likely you just had a problem while copying or a corrupt transfer or zip file or whatever, because otherwise your external location would report problems when working in the application.
What do you think? RE: Function to Detect Non Ascii Characters Needed (Programmer) 6 Oct 12 04:13. Another thing to note: If you want to upload data of whole databases into SQL Server, there is an upsizing wizard in VFP. It doesn't help with your corruptions, but once you made sure your copies of the local stores are intact, you can import all data without programming individual code. Data migration is a problem so common, it shouts for already having been solved.
Just one hint: Don't use the native upsizing wizard, use the improved one of VFPX: Bye, Olaf. RE: Function to Detect Non Ascii Characters Needed (Programmer) 6 Oct 12 05:53. Then why do you write about such a detail? I wa already writing about a much more general approach and want to stop stynley from going this route only leading into more detail problems he wouldn't have in the first place, if he just can get intact copies of the databases and then also use standard data migration tools like the advanced VFPX upsizing wizard. We really all led him a wrong track, because we were not seeing the larger problem. But once you se a thread getting longer than average you should ask for the larger scale problem and get back to a larger scale overview and stop fiddling with details. I address your mistake, because that only would lead to follow up discussions about why limiting the loop to go to 31 also didn't solve the problem.
I just want to nip this in the bud, it's not at all important, if stanlyn follows my advice to start again and go to the root of the problem instead of continuing from trunk to branch to perch to leaf. RE: Function to Detect Non Ascii Characters Needed (Programmer) 6 Oct 12 11:22. After having calmed down a bit:) The only thing I want to say, Stanlyn, is: If your databases comeing from local stores are corrupt, in the first place just repeat the copying of that data, before writing any analysis code. It's understood your current database copies have errors, but that doesn't mean the original files already had the errors. Just copy the databases again and see if the corruption really is permanent, ideally look into the original dbf files remotely, eg via teamviewer.
Does it already show irregularities in the same tables and records? If there is no problem locally, you also don't need to dive in to code needed for corruption analysis. The ideal way to assure data has come over intact is to compute checksums via md5, store them in eg Table.md5 per dbf or dbc and zip them together with all files, then recheck these checksums after unzipping locally. And if detecting a file with a checksum mismatch repeat to copy it. If, and only if, it turns out the data of some local places is corrupt already, then you can address this with analysis code, but then you also will need to decide what to do about these local stores, eg replace hardware or at least drives, what data to save and what to take as loss and what related data could reproduce some info or not.
But for now you're doing the last step first. RE: Function to Detect Non Ascii Characters Needed (Programmer). All, This is what know. I've gotten multiple copies of each locations data by actually visiting the sites and coping the entire database to a portable drive.
Also in the past, I've zipped up the database and ftped them to my site. No matter what way I do, the corruption is there and its consistant. I don't have a choice but to salvage these tables as its far better to mark.001 percent of the records bad, print them out, and deal with each one on an individual basis. The good thing is the tables that has corruption are not part of any parent/child relationships, as they are stand-alone.
The approach I'm doing now will work with these non-parent/child tables. It involves doing all the work on the vfp side before sending it to sql. I'm iterating through each record and field replacing bad data with a literal that indicates what the bad data was. For example, if the field has a chr(0) in it, I'm replacing that value with a literal 'chr(0)' that tells me that a chr(0) was found here originally, but now the field has a valid string. Originally, when I was just dealing with only a chr(0), I was marking the field as literal 'BAD', that Olaf picked up on. Now that I'm dealing with the whole spectrum, I'm replacing the fields value with a literal version of what was wrong.
The tables were originally created by me in 1998 using vfp3, so no clipper here. Olaf, you are saying that the corrupted logicals values will always be.F., simply because they are not true, because of the corruption shown in the browse. At this point I uncovered chr(0), chr(146), chr(151), chr(171), chr(163), chr(167), chr(147), chr(145) and others, and this is after some cleanup. >>there is an upsizing wizard in VFP Yes, i know.
I tried it, and had a lot of conversion issues, so I rolled my own where I have full control. As as you said, it cannot deal with the issues at hand, while the solutions presented can when coupled with my upsizing code. And please do be detailed as I've learned so much from them. Thanks, Stanley RE: Function to Detect Non Ascii Characters Needed. >No matter what way I do, the corruption is there and its consistant OK, then it's worth to continue finding a way to identify corrupt records. I think, though, you should not do that by looping over records and fields, we should find a more general approach handling the whole dbf file. It takes too long to loop over all records and fields and the logical field type shows you could see wrong chars in the browse, but eval(logical) will yield.F.
Anyway, so a wrong byte stored in the dbf file at the logical field position is not detected this way. >Olaf, you are saying that the corrupted logicals values will always be.F., simply because they are not true Yes, that's what I see when simply setting a non allowed char for the logical field in the hex editor. I don't know what you mean by 'uncovering' chr(145) etc in a logical field.
Via Hex Editor? (Then it would help specifying hex values instead). A logical field is stored with the length of a bytes, actually, not just a bit as logical.t./.f. Would suggest. A byte only is a char in char type fields, otherwise it's a portion of a file storing 8 bits with values from 0-255. So I wouldn't talk about chr(145), I would talk about a byte 0x91.
What you see in a browse window in case of such an unexpected byte value is not what you read from the logical field, or what Eval('field') results in. And your code only cares for the latter now, which is not detecting these faults. You really would need to read the bytes from the dbf file as the hex editor does, which causes no interpretation or evaluation of the bytes, but takes them as is. Then you'd not only visually see wrong data in the logical fields, but also detect wrong byte values. O illustrate what I want to say with this: I took Hexeditor again and put in hex 91 at the byte position of a logical field (named llogic) of a dbf. Now the browse window shows a backtick char (´).
Llogic prints a.F. On the screen and eval(llogic) alsoe evaluates to.F. I also downloaded DBFViewer200 and let it read the 'defect' DBF with 0x91 in the logical field. It also simply displays the backtick and reports no error in the field. So this is what I earlier said, Foxpro does not care about these wrong and unexpected byte values, the field is 'intact', as it's field value is still a logical value.f., while the browse visually shows a defect. And that's not the only reason looping over all data via the normal way to access dbf fields of it's records will not always work, you might later on have more serious defects letting you not even open a dbf file. What I suggested therefor was to make a rough check by replicate(chr(0),16) $ filetostr('your.dbf'), a burst of 16 0-bytes being in the dbf.
That's just a rough estimate, as 16 0-bytes indicate an error most probably, but can also be correct, eg in 4 integer fields all having stored 0 you get 16 0-bytes, 4 of each integer field. I hope this clears some of the points I made. You can do your own way of reading a dbf, if you have some meta data about your dbfs stored in a safe place reliably, then you can read in a dbf fully via lcDBF = FileToStr('your.dbf') and then separate this into the header and records. All you need for that is HEADER() and RECSIZE() of an uncorrupt DBF. Left(lcDBF,HEADER()) is the header, the rest of the lcDBF string is the records. You can check header validity without writing your own validation code, as there is SET TABLEVALIDATE and just a USE of the DBF tests it's header more or less depending on the TABLEVALIDATE level you set.
You can then check the records as substrings of the rest of lcDBF. This is the approach I would take not not depend on what VFP reads eg of logical fields. You can really read the bytes store din the dbf file and check them for validity. Commercial tools like foxfix do it that way, and in the end I suggest you don't roll your own here. After data has been amended, I'm quite confident you would be able to use SQL Server Upsizing wizard, at least as supportive tool and/or take out code from it for your own import tool. Don't reinvent the wheel here, even more so, if you need to learn much stuff, before you can reliably say which conditions point to corrupt data.
You would need to know the low level byte composition of all the different foxpro types, if going this route, and as I explained it's the only reliable way. You can of course stay with such simple approaches as checking chr(0)$charfield, but you don't hve anything but visual inspection by browse, to detect wrong logical field contents, unless you read in the dbf bytes 1:1. If you have a low number of corrupt records, the chr(0) approach may be sufficient, as long as you have no dbf header defects and can still read the corrupt data.
And then you perhaps won't need to check other field types like dates or numeric fields, because they seldom would contain the only defect bytes, but a defect in a record will mostly also show in a char field. We don't know. You decide on your own and you can complete your solution so far or start with the newer ideas. You have many ingredients and ideas now, but completing this is beyond what a forum thread is for.
Your initial question is answered I think and we identified the problem. At this point you can finish your own code or begin a new analysis tool. I also would suggest, if you have further detail questions, rather start a new thread now. RE: Function to Detect Non Ascii Characters Needed (Programmer). Thanks to ALL for your good detailed explanations in solving this problem. Thanks for the low level table troubleshooting info and ideas that I'll be expermenting with today. One thing I did notice was that all records that had logical corruption also has char corruption, and in almost all cases the record had no useful data to determine what process created it.
Of-course after doing a manual inspection of said record, if there is enough data to recreate we will, otherwise just delete it as there is no way to get the correct data for that record. That would also be true even if you took the low level approach, as the data is corrupt. Just because its frequency is so low, I'm (1) logging it, (2) replacing the chars with good ones to create a stable record, and (3) manually correct the data or delete the record from the logs. Also note that when I look at these corrupted records in natural order, they got introduced into the table in the 2003 era as suggested by good records listed immediately before and after the corrupted records. I've not seen any issues from data created since then. And Yes, I have enough info on solving this threads original mission, and thanks to all.
I consider this thread closed now. Any other comments or questions will start a new thread. Again thanks, Stanley RE: Function to Detect Non Ascii Characters Needed (Programmer) 8 Oct 12 06:26.